Mayors in Major US Cities

Introduction

With the runoff mayoral election looming, Chicago is set to make history. The city will become the largest American city ever to elect a female person of color as mayor. I was curious at how common it is for women to be mayors, as well as how often people of color govern cities in today’s political landscape. Unfortunately, I could not find an aggregated source of information on the ethnicites and genders of mayors. What I did find is this list on Ballotpedia, which includes links to easily-scraped data on the mayors of the largest 100 cities in United States. I also gathered information specific to each city, such as demographics and economics indicators like unemployment and median income. I built the scraper in Python, which you can view here.

Below I analyze this dataset using Python for data processing and some initial analysis, and R for building plots for further exploratory analysis. Interesting differences arise between the cities that elect women and those that elect people of color. Cities that elect women are less African-American, and on average are smaller. Cities that elect people of color as mayor, on the other hand, have smaller fiscal budgets than cities without a person of color as mayor. To assess how differences in city demographics and other characteristics interact and influence the ethnicity and sex of elected mayors, I also perform regression analysis using R.

Process Data

First we read in the dataset. Below is a full list of the columns and the variable types. There is a mix of information specific to the office (i.e. term_ends, took_office, gov_type), data specific to the city (i.e. median_income, unemployment and the ethnicity variables starting in perc_), and variables specific to individual mayors (mayor_gender_f_1, mayor_race_nonwhite_1).

import pandas as pd
import os
import re
df = pd.ExcelFile("./mayors/mayors_data.xlsx").parse("Sheet1")
print(df.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 100 entries, 0 to 99
## Data columns (total 39 columns):
## Unnamed: 0                    100 non-null int64
## city_name                     100 non-null object
## city_url                      100 non-null object
## gov_type                      100 non-null object
## mayor_name                    100 non-null object
## mayor_url                     100 non-null object
## term_ends                     100 non-null int64
## took_office                   100 non-null int64
## 2013_population               100 non-null int64
## city_council_seats            100 non-null int64
## city_seal                     100 non-null object
## college_grad_rate             100 non-null float64
## ethnicity                     93 non-null float64
## fiscal_year_budget            100 non-null object
## gender                        100 non-null float64
## hs_grad_rate                  100 non-null float64
## last_city_council_election    100 non-null object
## last_mayoral_election         100 non-null object
## mayor_party                   87 non-null object
## mayor_pic                     99 non-null object
## median_income                 100 non-null int64
## next_city_council_election    100 non-null object
## next_mayoral_election         100 non-null object
## race                          100 non-null object
## total_budget                  100 non-null object
## unemployment                  99 non-null float64
## budget                        100 non-null int64
## race_json                     100 non-null object
## perc_asian                    100 non-null float64
## perc_hispanic                 1 non-null float64
## perc_white                    100 non-null float64
## perc_white_not_hispanic       12 non-null float64
## perc_african_american         100 non-null float64
## perc_native_american          100 non-null float64
## perc_pacific_islander         100 non-null float64
## perc_two_or_more              100 non-null float64
## mayor_gender_f_1              100 non-null int64
## mayor_race_nonwhite_1         100 non-null int64
## mayor_race                    100 non-null object
## dtypes: float64(13), int64(9), object(17)
## memory usage: 30.5+ KB
## None

In the scraper that I linked to above, we already performed a great deal of the nitty-gritty data processing. However, there are a few tweaks to the dataset we need to make before any analysis. The first is retrieving the party from the name of the mayor. For whatever reason, the party information is lacking on each mayor’s individual page on Ballotpedia, but the mayor’s party exists attached the mayor_name column in the dataset as shown below.

print(df.mayor_name.head())
## 0      Bill de Blasio (D)
## 1       Eric Garcetti (D)
## 2        Rahm Emanuel (D)
## 3    Sylvester Turner (D)
## 4        James Kenney (D)
## Name: mayor_name, dtype: object

Using regular expressions, I create two simple functions to extract each mayor’s party affiliation, and clean up the name column. I use the handy map function in the pandas package to create the new party variable. Below are the cleaned up columns:

#retrieve party
def find_party(x):
  try:
    return(re.search('\((.*?)\)', x).group(1))
  except:
    return(None)
df["party"] = df.mayor_name.map(find_party)
#clean up mayor name
def trim_name(x):
  try:
    return(re.sub(r"\((.*?)\)", "",x))
  except:
    return(x)
df["mayor_name"] = df["mayor_name"].map(trim_name)
print(df[["mayor_name","party"]].tail())
##              mayor_name party
## 95            Lily Mei      I
## 96        David Bieter      D
## 97       John Valdivia      I
## 98  Randall L. Woodfin      D
## 99        Levar Stoney      D

There is still some missing party information for a few mayors, but after a quick search on Wikipedia I was able to track down the missing data. I make manual edits below.

#manual party edits
print(df[["mayor_name","party"]][df.party.isnull()])
##            mayor_name party
## 23       David Briley  None
## 68  Harry LaRosiliere  None
## 90       Jenn Daniels  None
df.loc[df["mayor_name"] == "David Briley","party"] = "D"
df.loc[df["mayor_name"] == "Harry LaRosiliere","party"] = "R"
df.loc[df["mayor_name"] == "Jenn Daniels","party"] = "I"

Descriptive Stats in Python

To begin the analysis, let’s start by getting some basic value counts. Below we see that there are many more Democratic mayors in the 100 largest cities in the U.S. than there are Republican and independent mayors.

print(df['party'].value_counts())
## D    61
## R    25
## I    14
## Name: party, dtype: int64

We now look at the number of offices by government type. The government type defines the power structure and the way that responsibility is allocated between the city council and the mayor. For a good description on the differences between each type, read the Ballotpedia page here.

print(df['gov_type'].value_counts())
## Strong mayor       46
## Council-manager    46
## Hybrid              6
## Strong mayor*       1
## City commission     1
## Name: gov_type, dtype: int64

Below is the number of mayors that are male and female in the top 100 cities. I also print the cities that have female mayors. We see that the majority of mayors are men.

print(df['mayor_gender_f_1'].value_counts())
## 0    77
## 1    23
## Name: mayor_gender_f_1, dtype: int64
print(df[["mayor_name","city_name"]][df.mayor_gender_f_1==1])
##                    mayor_name                   city_name
## 5            Thelda Williams             Phoenix, Arizona
## 13              London Breed    San Francisco, California
## 15                  Vi Lyles    Charlotte, North Carolina
## 16               Betsy Price            Fort Worth, Texas
## 20              Jenny Durkan          Seattle, Washington
## 24            Catherine Pugh          Baltimore, Maryland
## 28           Carolyn Goodman            Las Vegas, Nevada
## 38            Keisha Bottoms             Atlanta, Georgia
## 40             Jean Stothert              Omaha, Nebraska
## 41           Nancy McFarlane      Raleigh, North Carolina
## 43  Elizabeth "Libby" Schaaf          Oakland, California
## 49           LaToya Cantrell       New Orleans, Louisiana
## 50                 Karen Goh      Bakersfield, California
## 56              Lyda Krewson          St. Louis, Missouri
## 59              Linda Gorton          Lexington, Kentucky
## 66          Nancy B. Vaughan   Greensboro, North Carolina
## 69               Debra March            Henderson, Nevada
## 73                Mary Salas      Chula Vista, California
## 86       Lori Barnett Dodson               Garland, Texas
## 88           Hillary Schieve                 Reno, Nevada
## 90               Jenn Daniels            Gilbert, Arizona
## 91      Sharon Weston Broome       Baton Rouge, Louisiana
## 95                  Lily Mei          Fremont, California

Below is the number of mayors that are people of color in the top 100 cities. Please note that the information on sex and ethnicity did not exist in the original dataset pulled from Ballotpedia. I manually entered this data by looking at each mayor’s Wikipedia page. Race (and gender) are subjective terms, so this part of the dataset is subject to my bias. Nonetheless, we see that majority of mayors are white.

print(df['mayor_race_nonwhite_1'].value_counts())
## 0    78
## 1    22
## Name: mayor_race_nonwhite_1, dtype: int64
print(df[["mayor_name","city_name","mayor_race"]][df.mayor_race_nonwhite_1==1])
##                mayor_name                  city_name              mayor_race
## 3       Sylvester Turner              Houston, Texas                   black
## 13          London Breed   San Francisco, California                   black
## 15              Vi Lyles   Charlotte, North Carolina                   black
## 21    Michael B. Hancock            Denver, Colorado                   black
## 24        Catherine Pugh         Baltimore, Maryland                   black
## 35             Sly James       Kansas City, Missouri                   black
## 38        Keisha Bottoms            Atlanta, Georgia                   black
## 42        Francis Suarez              Miami, Florida                hispanic
## 49       LaToya Cantrell      New Orleans, Louisiana                   black
## 54           Harry Sidhu         Anaheim, California       indian/asian/sikh
## 55         Miguel Pulido       Santa Ana, California                hispanic
## 62         Michael Tubbs        Stockton, California                   black
## 64     Melvin Carter III         St. Paul, Minnesota                   black
## 67         Ras J. Baraka          Newark, New Jersey                   black
## 68      Harry LaRosiliere               Plano, Texas                   black
## 71           Byron Brown           Buffalo, New York                   black
## 73            Mary Salas     Chula Vista, California                hispanic
## 79     Kenneth Alexander           Norfolk, Virginia                   black
## 91  Sharon Weston Broome      Baton Rouge, Louisiana                   black
## 95              Lily Mei         Fremont, California  asian/pacific islander
## 98    Randall L. Woodfin         Birmingham, Alabama                   black
## 99          Levar Stoney          Richmond, Virginia                   black

Finally, since Chicago is set to be the largest city ever in the U.S. to ever have a female person of color as mayor, I print the 8 cities where this condition is met. 3 of these cities are in California, and 2 are in Louisiana.

print(df[["mayor_name","city_name"]][(df.mayor_race_nonwhite_1==1)&(df.mayor_gender_f_1==1)])
##                mayor_name                  city_name
## 13          London Breed   San Francisco, California
## 15              Vi Lyles   Charlotte, North Carolina
## 24        Catherine Pugh         Baltimore, Maryland
## 38        Keisha Bottoms            Atlanta, Georgia
## 49       LaToya Cantrell      New Orleans, Louisiana
## 73            Mary Salas     Chula Vista, California
## 91  Sharon Weston Broome      Baton Rouge, Louisiana
## 95              Lily Mei         Fremont, California

Most of the interesting features are numeric. Before we transition to R to build plots, below we print the distribution of all the numeric variables in the dataset.

numerics = ['int64','float64']
print(df.select_dtypes(include=numerics).iloc[:,1:5].describe())
##          term_ends  took_office  2013_population  city_council_seats
## count   100.000000   100.000000     1.000000e+02          100.000000
## mean   2020.390000  2013.880000     6.165657e+05           11.020000
## std       1.109145     4.297709     9.481099e+05            7.774577
## min    2019.000000  1994.000000     2.078780e+05            5.000000
## 25%    2019.000000  2012.750000     2.562428e+05            7.000000
## 50%    2020.000000  2015.000000     3.711725e+05            9.000000
## 75%    2021.000000  2017.000000     6.134858e+05           11.250000
## max    2023.000000  2019.000000     8.400000e+06           51.000000
print(df.select_dtypes(include=numerics).iloc[:,5:10].describe())
##        college_grad_rate  ethnicity      gender  hs_grad_rate  median_income
## count         100.000000  93.000000  100.000000    100.000000     100.000000
## mean            0.312150   0.252462    0.510220      0.839330   49840.810000
## std             0.104223   0.213394    0.009574      0.065322   13395.317378
## min             0.118000   0.023000    0.478000      0.652000   26556.000000
## 25%             0.241000   0.080000    0.505000      0.800000   41997.000000
## 50%             0.296000   0.161000    0.510000      0.849500   47840.000000
## 75%             0.368250   0.378000    0.516000      0.880000   53763.750000
## max             0.657000   0.956000    0.532000      0.966000   99169.000000
print(df.select_dtypes(include=numerics).iloc[:,10:15].describe())
##        unemployment        budget  perc_asian  perc_hispanic  perc_white
## count     99.000000  1.000000e+02  100.000000          1.000  100.000000
## mean       0.067414  2.585186e+09    0.077110          0.274    0.526450
## std        0.023262  8.020317e+09    0.094522            NaN    0.185645
## min        0.034000  1.600000e+08    0.004000          0.274    0.034000
## 25%        0.047500  5.671675e+08    0.029000          0.274    0.440750
## 50%        0.063000  9.433000e+08    0.040500          0.274    0.535000
## 75%        0.079500  1.745000e+09    0.080500          0.274    0.666750
## max        0.145000  7.300000e+10    0.548000          0.274    0.893000
print(df.select_dtypes(include=numerics).iloc[:,15:18].describe())
##        perc_white_not_hispanic  perc_african_american  perc_native_american
## count                12.000000             100.000000            100.000000
## mean                  0.332750               0.207280              0.006190
## std                   0.129938               0.181885              0.010812
## min                   0.068000               0.000000              0.000000
## 25%                   0.264250               0.063000              0.000000
## 50%                   0.302500               0.152500              0.004000
## 75%                   0.408250               0.286250              0.008000
## max                   0.576000               0.827000              0.081000
print(df.select_dtypes(include=numerics).iloc[:,18:21].describe())
##        perc_pacific_islander  perc_two_or_more  mayor_gender_f_1
## count             100.000000        100.000000        100.000000
## mean                0.001070          0.036510          0.230000
## std                 0.008426          0.018996          0.422953
## min                 0.000000          0.000000          0.000000
## 25%                 0.000000          0.026000          0.000000
## 50%                 0.000000          0.034000          0.000000
## 75%                 0.000000          0.044500          0.000000
## max                 0.084000          0.165000          1.000000

Exploratory Analysis

First we create a new R dataframe called modeldf which we will use to build plots and build our logistic regression models. We also coerce our model’s dependent variables as factors. For the plots below, I choose to use an imported plot-styling function called bbc_style() that gives the plots a similar appearance to those published by the BBC.

df = py$df

modeldf = df[,c("2013_population","city_council_seats","college_grad_rate","term_ends","took_office"
                ,"ethnicity","hs_grad_rate","gov_type"
                ,"median_income","unemployment","budget"
                ,"perc_asian","perc_hispanic","perc_white","perc_white_not_hispanic"
                ,"perc_african_american","perc_native_american","perc_pacific_islander","perc_two_or_more"
                ,"mayor_gender_f_1","mayor_race_nonwhite_1","mayor_race","party","mayor_name","city_name"
                )]

modeldf$mayor_gender_f_1 = as.factor(modeldf$mayor_gender_f_1)
modeldf$mayor_race_nonwhite_1 = as.factor(modeldf$mayor_race_nonwhite_1)

##manually added 2019 estimate from here: https://www.bls.gov/web/metro/laulrgma.htm
modeldf$unemployment[modeldf$city_name=="Minneapolis, Minnesota"] = .036

First we plot the frequency count of mayors by type and by party affiliation. We see that the majority of mayors in the top 100 cities are Democrats. Surprisingly, female mayors have nearly the same ratio of Republicans as do white male mayors. There are 3 people of color who are Republican mayors, while no female mayors of color are Republican.

plotdf = modeldf
plotdf$type = ifelse(plotdf$mayor_race_nonwhite_1==1&plotdf$mayor_gender_f_1==1,"Female Person of Color",
                     ifelse(plotdf$mayor_race_nonwhite_1==1,"Person of Color",
                            ifelse(plotdf$mayor_gender_f_1==1,"Female","White Male")))
plotdf$party = ifelse(plotdf$party=="R","Republican",
                      ifelse(plotdf$party=="D","Democrat","Independent"))
totals = as.data.frame(table(plotdf$type))
plotdf = as.data.frame(table(plotdf$type,plotdf$party))
plotdf = plotdf[plotdf$Freq!=0,]

ggplot(data = plotdf, 
       aes(x = Var1,
           y = Freq,
           fill = Var2)) +
  geom_bar(stat = "identity") +
  bbc_style() +
  scale_fill_manual(values = c("#3358FF", "#8633FF","#C70039")) +
  geom_text(data = plotdf, aes(x = Var1, y = Freq, label = Freq),size = 3, position = position_stack(vjust = 0.5),fontface = "bold") +
  geom_text(data = totals, aes(x = Var1, y = Freq, label = Freq, fill = NULL),  size = 4, vjust = -0.4,fontface = "bold") +
  labs(title = "The majority of mayors in the US's largest cities are white men",
       subtitle = "Mayors in the 100 largest US cities by party affiliation") +
  theme(legend.position = "top", 
        legend.justification = "left") +
  guides(fill = guide_legend(reverse = TRUE)) 

We now compare the breakdown of the ethnicity of cities governed by female mayors versus those controlled by mayors of color. Here we see that there is a clear difference in the percentage of people who identify as African-American. Cities governed by female mayors are on average less African-American than those governed by mayors who are people of color. When we run logistic regression, we will investigate if this is true when controlling for other mayoral and city characteristics.

plotdf = modeldf

plotdf = aggregate(cbind(perc_asian,perc_white,
                         perc_white_not_hispanic,perc_african_american,
                         perc_native_american,perc_pacific_islander,
                         perc_two_or_more
) ~ mayor_gender_f_1 + mayor_race_nonwhite_1, data = plotdf, mean, na.rm = TRUE)

plotdf = plotdf[plotdf$mayor_gender_f_1==1|plotdf$mayor_race_nonwhite_1==1,]

plotdf$type = ifelse(plotdf$mayor_gender_f_1==1,"Female","Person of Color")

plotdf = melt(plotdf[,colnames(plotdf)[!(colnames(plotdf) %in% c("mayor_gender_f_1","mayor_race_nonwhite_1"))]],
              id=c("type"))

plotdf = plotdf[order(plotdf$value),]
plotdf$value = ifelse(plotdf$type=="Female",-plotdf$value,plotdf$value)

plotdf$variable = ifelse(plotdf$variable=="perc_white","White",
                  ifelse(plotdf$variable=="perc_white_not_hispanic","White Not Hispanic",
                  ifelse(plotdf$variable=="perc_african_american","African American",
                  ifelse(plotdf$variable=="perc_asian","Asian",
                  ifelse(plotdf$variable=="perc_two_or_more","Two or More Races",
                  ifelse(plotdf$variable=="perc_native_american","Native American","Pacific Islander"))))))

brks = seq(-1, 1, .5)
lbls = paste0(as.character(c(seq(100, 0, -50), seq(50, 100, 50))), "%")

ggplot(plotdf, aes(x = reorder(variable, abs(value)), y = value, fill = type)) +
  geom_bar(stat = "identity", width = .6) + 
  scale_y_continuous(breaks = brks,
                     labels = lbls) + 
  geom_text(aes(x=variable, y=value, 
                label=ifelse(variable%in%c("Native American","Pacific Islander"),"",paste(round(value*100,4),"%",sep = "")),
                hjust=.5),
            fontface = "bold") +
  coord_flip() +
  labs(title="Big cities with female mayors are less African-American",
       subtitle="Ethnicity in 100 largest American cities (% of population)") +
  bbc_style() +
  theme(legend.position = "top", 
        legend.justification = "left") +
  scale_fill_brewer(palette = "Dark2") 

We also see that female mayors and mayors who are people of color control smaller cities. However, female mayors control larger than average budgets, while people of color control fiscal budgets that are smaller than the average across all 100 cities.

plotdf = as.data.frame(
  cbind(
    rbind("Person of Color","Female","Female Person of Color","White Male"),
    rbind(
      mean(modeldf$`2013_population`[modeldf$mayor_race_nonwhite_1==1]),
      mean(modeldf$`2013_population`[modeldf$mayor_gender_f_1==1]),
      mean(modeldf$`2013_population`[modeldf$mayor_race_nonwhite_1==1&modeldf$mayor_gender_f_1==1]),
      mean(modeldf$`2013_population`[modeldf$mayor_race_nonwhite_1!=1&modeldf$mayor_gender_f_1!=1])),
    rbind(
      mean(modeldf$budget[modeldf$mayor_race_nonwhite_1==1]),
      mean(modeldf$budget[modeldf$mayor_gender_f_1==1]),
      mean(modeldf$budget[modeldf$mayor_race_nonwhite_1==1&modeldf$mayor_gender_f_1==1]),
      mean(modeldf$budget[modeldf$mayor_race_nonwhite_1!=1&modeldf$mayor_gender_f_1!=1]))
  )
)

colnames(plotdf) = c("type","population","budget")
plotdf$population = as.numeric(as.character(plotdf$population))
plotdf$budget = as.numeric(as.character(plotdf$budget))
plotdf$average_budget = mean(modeldf$budget)
plotdf$diff_budget = (plotdf$budget-plotdf$average_budget)/plotdf$average_budget
plotdf$type_budget = ifelse(plotdf$diff_budget<0,"Below Average", "Above Average")
plotdf$average_population = mean(modeldf$`2013_population`)
plotdf$diff_population = (plotdf$population-plotdf$average_population)/plotdf$average_population
plotdf$type_population = ifelse(plotdf$diff_population<0,"Below Average", "Above Average")

p1=ggplot(plotdf, aes(x=type, y=diff_budget, label=budget)) + 
  geom_bar(stat='identity', aes(fill=type_budget), width=.5)  +
  scale_fill_manual(name="Mileage", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("Above Average"="#339A00", "Below Average"="#CD5C5C")) + 
  scale_y_continuous(labels = scales::percent_format()) +
  geom_text(data = plotdf, 
            aes(x=type, y=diff_budget, label=paste(paste("$",round(budget/1000000000,2),sep = ""),"Billion")),
            size = 3, position = position_stack(vjust = 0.5),fontface = "bold") +
  labs(
    title = "Fiscal Budget",
    x = "Degree Above/Below Average"
    # title= "Female mayors control larger budgets",
    # subtitle="Average size of latest fiscal year budget (billions of dollars)'"
    ) + 
  bbc_style() +
  theme(legend.position = "top", 
        legend.justification = "top",
        plot.title = element_text(hjust = 0.5),
        axis.text=element_text(size=10,face = "bold")) +
  coord_flip()

p2=ggplot(plotdf, aes(x=type, y=diff_population, label=population)) + 
  geom_bar(stat='identity', aes(fill=type_population), width=.5)  +
  xlab("Degree Above/Below Average") +
  bbc_style() +
  scale_fill_manual(name="Mileage", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("Above Average"="#339A00", "Below Average"="#CD5C5C")) + 
  scale_y_continuous(labels = scales::percent_format()) +
  geom_text(data = plotdf, aes(x=type, y=diff_population, label=paste(round(population/1000,2),"Million")),
            size = 3, position = position_stack(vjust = 0.5), fontface = "bold") +
  labs(
    title = "2013 Population",
    x = "Degree Above/Below Average"
    # title="White men are mayors of larger US cities", 
    #    subtitle= "Average 2013 population (millions)"
    ) + 
  theme(legend.position = "top", 
        legend.justification = "top",
        plot.title = element_text(hjust = 0.5),
        axis.text=element_text(size=10,face = "bold")) +
  coord_flip()

multiplot(p1, p2, cols=2)

To investigate why these differences in population and the size of the recent fiscal budget exist, I plot individual cities on a graph where the population of the city is on the x-axis and the latest fiscal budget is on the y-axis. We see that there are a few notable outliers that may cause the differences in averages across the mayoral types. We see that Houston is a very populous city with a relatively small budget for its size, which may help drag down the average fiscal budget controlled by mayors of color. On the other hand, Phoenix appears to have an outsized fiscal budget for its population, which forces the average budget controlled by female mayors upward.

plotdf = modeldf
plotdf = plotdf[plotdf$mayor_gender_f_1==1|plotdf$mayor_race_nonwhite_1==1,]
plotdf$type = ifelse(plotdf$mayor_gender_f_1==1,"Female",ifelse(plotdf$mayor_race_nonwhite_1==1,"Person of Color","White Male"))

ggplot(plotdf, aes(x= `2013_population`, y= budget,
                     label=city_name,colour = factor(type))) +
  bbc_style() +
  geom_point() +
  geom_text(aes(label=city_name),hjust=0,vjust=0,
            check_overlap = TRUE) +
  scale_color_manual(values=c("#FAAB18", "#1380A1")) +
  labs(x = "Population", y = "Budget",colour = "Cluster",
       title="Phoenix and Houston are noticeable outliers",
       subtitle = "Population by Budget for 100 Largest Cities") +
  theme(legend.position = "top", 
        legend.justification = "top",
        #plot.title = element_text(hjust = 0.5),
        axis.text=element_text(size=10,face = "bold"))

In the next three plots we investigate the distributions of different city characteristics using box plots. Below we see that the median income for cities controlled by female mayors is higher than those controlled by people of color. The two outliers for the female mayor group are Fremont, California and Gilbert, Arizona, both of which have median incomes above $80K. The outlier for the person of color group is Plano, Texas, which also has a median income above $80K. It’s worth noting that the mayor here is Henry LaRosiliere, the only Black Republican mayor in the dataset.

medians = aggregate(median_income ~ type, data = plotdf, median)
ggplot(plotdf, aes(type, median_income)) +
  bbc_style() +
  geom_boxplot(varwidth=T, fill="plum") + 
  geom_text(data = medians, 
            aes(x = type, y = median_income, label = paste0("$",median_income)),
            vjust=-.35) +
  labs(title="Female Mayors Govern Cities with Higher Incomes", 
       subtitle="Distribution of Median Income by Mayor Type")

We also see that the college graduation rate is lower in cities governed by mayors who are people of color. The median college graduation rate for cities controlled by women is 32.7%, whereas it is 24.6% in cities controlled by mayors who are people of color.

medians = aggregate(college_grad_rate ~ type, data = plotdf, median)
ggplot(plotdf, aes(type, college_grad_rate)) +
  bbc_style() +
  geom_boxplot(varwidth=T, fill="#FAAB18") + 
  geom_text(data = medians, 
            aes(x = type, y = college_grad_rate, label = paste0(college_grad_rate*100,"%")),
            vjust=-.35) +
  labs(title="Mayors of Color Govern Cities with Lower College Grad Rates", 
       subtitle="Distribution of College Grad Rates by Mayor Type")

The plot belows creates a boxplot by mayor type for the unemployment rates across the cities in the dataset. We see that unemployment rates are roughly the same between mayoral types.

medians = aggregate(unemployment ~ type, data = plotdf, median)
ggplot(plotdf, aes(type, unemployment, label = unemployment),vjust=-.35) +
  bbc_style() +
  geom_boxplot(varwidth=T, fill="#1380A1") + 
  geom_text(data = medians, 
            aes(x = type, y = unemployment, label =  paste0(unemployment*100,"%")),
            vjust=-.35) +
  labs(title="Unemployment Rates are Similar Across Mayoral Types", 
       subtitle="Distribution of Unemployment Rates by Mayor Type")

Model Building

The plots above are good for understanding the correlations of various city and mayoral features by mayoral type. However, to understand how all of these factors interact with each other and influence the outcome of female or person of color for each city’s mayor, we need to build a model. We will build separate regression models for these two outcomes. Logistic regression lends itself well to this use case because of its quick computation time, it’s applicability to small datasets, and it’s easy interpretability.

To aid further in interpretability, I build a function called generate_plot below. When we retrieve our covariate estimates, we will use the generate_plot function to hold all other factors used in the model constant, and only vary a specific variable. This will allow us to see how the probability of the outcome female or person_of_color changes when a single variable changes, holding the other model features constant.

generate_plot = function(chosen_var){
  
  vars = names(coef(summary(backwards))[,1])
  vars = vars[2:length(vars)]
  vars = gsub("`","",vars)
  
  vars_constant = vars[!(vars %in% chosen_var)]
  newdata = head(modeldf[,vars],10)
  
  for(i in 1:length(vars_constant)){
    newdata[,vars_constant[i]] = rep(mean(modeldf[,vars_constant[i]]),10)
  }
  
  newdata[,chosen_var] = seq(min(modeldf[,chosen_var]),max(modeldf[,chosen_var]),length.out = 10)
  
  newdata$pred = predict(backwards,newdata = newdata,type = "response")
  newdata$pred_diff = c(0,diff(newdata$pred,lag = 1))
  print(paste("Mean increase/decrease in probability for variable",chosen_var,":",mean(newdata$pred_diff)))
  
  return(
    ggplot(newdata, aes(x = newdata[,chosen_var], y = pred)) +
      geom_point() +
      labs(x = chosen_var, y = "estimated_probability")
  )
}

Female Mayor

Below we initialize a model equation with all of the variables we think may impact our mayor_gender_f_1 dependent variable. Using the glm function, we create a logistic regression model object, and then use the step function to perform backwards selection.

In the past, I found backwards selection to be a quick but rigorous way to find a parsimonious model for regression methods. Backwards selection is known as a wrapper method for reducing the number of predictors in a model. Wrapper methods evaluate multiple models using approaches that add or remove predictors to find combinations of features that optimize model performance. In backward selection, the initial model contains all predictors (see the model_equation specification below) which are iteratively removed to determine which do not have a significant contribution to the model’s performance.

The below step function uses the AIC statistic to evaluate model performance. The AIC statistic is essentially a penalized version of the tradition sum of squares error metric, and is useful in our case because it applies a penalty for the number of terms in the model, thereby ensuring a simpler model. The function below will print out each stage of the variable paring-down process, and will create an object we call backwards which will have the final pared-down model object.

model_vars = c("median_income","perc_african_american","perc_white","unemployment","2013_population",
               "budget","hs_grad_rate","college_grad_rate","mayor_gender_f_1","mayor_race_nonwhite_1")

model_equation = as.formula(paste("mayor_gender_f_1~perc_african_american+`2013_population`+hs_grad_rate+median_income+budget+college_grad_rate"))

logistic_model = glm(model_equation,
                     family = binomial(link = "logit") ,
                     data = modeldf[,model_vars])

#backwards selection is the default in this function
backwards = step(logistic_model)
## Start:  AIC=106.95
## mayor_gender_f_1 ~ perc_african_american + `2013_population` + 
##     hs_grad_rate + median_income + budget + college_grad_rate
## 
##                         Df Deviance    AIC
## - hs_grad_rate           1   93.221 105.22
## - college_grad_rate      1   93.436 105.44
## <none>                       92.953 106.95
## - median_income          1   96.317 108.32
## - perc_african_american  1   97.800 109.80
## - budget                 1   99.013 111.01
## - `2013_population`      1   99.990 111.99
## 
## Step:  AIC=105.22
## mayor_gender_f_1 ~ perc_african_american + `2013_population` + 
##     median_income + budget + college_grad_rate
## 
##                         Df Deviance    AIC
## - college_grad_rate      1   93.452 103.45
## <none>                       93.221 105.22
## - median_income          1   96.386 106.39
## - perc_african_american  1   97.969 107.97
## - budget                 1   99.300 109.30
## - `2013_population`      1  100.066 110.07
## 
## Step:  AIC=103.45
## mayor_gender_f_1 ~ perc_african_american + `2013_population` + 
##     median_income + budget
## 
##                         Df Deviance    AIC
## <none>                       93.452 103.45
## - perc_african_american  1   98.669 106.67
## - budget                 1   99.636 107.64
## - median_income          1   99.730 107.73
## - `2013_population`      1  100.403 108.40

Below we print out the formula for the model that the backwards selection process chose that minimized the AIC statistic, as well as the final AIC metric.

print(formula(backwards))
## mayor_gender_f_1 ~ perc_african_american + `2013_population` + 
##     median_income + budget
paste("Model AIC:",summary(backwards)$aic)
## [1] "Model AIC: 103.452257655713"

We then print the variable coefficient estimates below. The final model contains four variables that minimize the AIC statistic. As the above plots suggested, as the fiscal budget and median income in a city increase, so does the likelihood that the mayor of a city is female. Similarly, as the city population increases, the likelihood that the mayor of a city is female decreases. Surprisingly, however, we find that when controlling for other factors in the model, as the percentage of a city that identifies itself as African-American increases, so does the likelihood that a mayor will be female. This is a different conclusion from the one we drew earlier when we looked at the average rates of people in cities that identify as African-American across mayor types.

summary(backwards)
## 
## Call:
## glm(formula = mayor_gender_f_1 ~ perc_african_american + `2013_population` + 
##     median_income + budget, family = binomial(link = "logit"), 
##     data = modeldf[, model_vars])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3919  -0.7000  -0.5216  -0.1888   2.0798  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -4.236e+00  1.517e+00  -2.792  0.00524 **
## perc_african_american  3.820e+00  1.696e+00   2.253  0.02428 * 
## `2013_population`     -2.077e-06  1.111e-06  -1.869  0.06166 . 
## median_income          5.433e-05  2.251e-05   2.413  0.01580 * 
## budget                 2.140e-10  1.223e-10   1.750  0.08006 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107.855  on 99  degrees of freedom
## Residual deviance:  93.452  on 95  degrees of freedom
## AIC: 103.45
## 
## Number of Fisher Scoring iterations: 6

The first of the three graphs below bears this out. Holding all else constant, a city’s percentage of people who identify as African-American is actually a significant predictor of the likelihood that a city’s mayor is female. I also print similar graphs for a city’s fiscal budget and population.

generate_plot("perc_african_american")
## [1] "Mean increase/decrease in probability for variable perc_african_american : 0.0616976686424508"

generate_plot("budget")
## [1] "Mean increase/decrease in probability for variable budget : 0.0878881571300609"

generate_plot("2013_population")
## [1] "Mean increase/decrease in probability for variable 2013_population : -0.0351105066683875"

Mayor of Color

We perform the same backwards selection process for the model with mayor_race_nonwhite_1 as the target variable. Again, we initialize the model_equation with all variables, and pare down the model with the intent of creating a simpler model at the end of the process.

set.seed(88) 

model_vars = c("median_income","perc_african_american","perc_white","unemployment","2013_population",
               "budget","hs_grad_rate","college_grad_rate","mayor_gender_f_1","mayor_race_nonwhite_1")

model_equation = as.formula(paste("mayor_race_nonwhite_1~perc_african_american+`2013_population`+hs_grad_rate+median_income+budget+college_grad_rate"))

logistic_model = glm(model_equation,
                     family = binomial(link = "logit") ,
                     data = modeldf[,model_vars])

backwards = step(logistic_model)
## Start:  AIC=93.88
## mayor_race_nonwhite_1 ~ perc_african_american + `2013_population` + 
##     hs_grad_rate + median_income + budget + college_grad_rate
## 
##                         Df Deviance     AIC
## - budget                 1   79.929  91.929
## - `2013_population`      1   81.803  93.803
## <none>                       79.879  93.879
## - median_income          1   84.349  96.349
## - college_grad_rate      1   85.229  97.229
## - hs_grad_rate           1   93.699 105.699
## - perc_african_american  1   94.532 106.532
## 
## Step:  AIC=91.93
## mayor_race_nonwhite_1 ~ perc_african_american + `2013_population` + 
##     hs_grad_rate + median_income + college_grad_rate
## 
##                         Df Deviance     AIC
## <none>                       79.929  91.929
## - `2013_population`      1   83.225  93.225
## - median_income          1   84.384  94.384
## - college_grad_rate      1   85.413  95.413
## - hs_grad_rate           1   93.811 103.811
## - perc_african_american  1   94.532 104.532

We print the summary of the covariate estimates below, which includes the pared-down model specification as well as the final AIC statistic. Unsurprisingly, we see that the percentage of a city that identifies as African American is a strong indicator of the city’s mayor being a person of color. We also find that the high school graduation rate of a city is also an indicator of the city’s mayor being a person of color. Holding all else constant, as the high school graduation rate of a city decreases, the likelihood that a city’s mayor is a person of color increases.

summary(backwards)
## 
## Call:
## glm(formula = mayor_race_nonwhite_1 ~ perc_african_american + 
##     `2013_population` + hs_grad_rate + median_income + college_grad_rate, 
##     family = binomial(link = "logit"), data = modeldf[, model_vars])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5027  -0.6299  -0.3315  -0.1534   2.1417  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.215e+01  4.893e+00   2.484 0.012995 *  
## perc_african_american  6.937e+00  2.008e+00   3.454 0.000553 ***
## `2013_population`     -9.452e-07  6.661e-07  -1.419 0.155891    
## hs_grad_rate          -2.523e+01  7.517e+00  -3.356 0.000790 ***
## median_income          6.533e-05  3.170e-05   2.061 0.039316 *  
## college_grad_rate      9.875e+00  4.431e+00   2.229 0.025837 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 105.382  on 99  degrees of freedom
## Residual deviance:  79.929  on 94  degrees of freedom
## AIC: 91.929
## 
## Number of Fisher Scoring iterations: 6

We generate the plots below for perc_african_american and hs_grad_rate to see how these variables impact the predicted probability, holding all other variables at their average levels.

generate_plot("perc_african_american")
## [1] "Mean increase/decrease in probability for variable perc_african_american : 0.0885319076493798"

generate_plot("hs_grad_rate")
## [1] "Mean increase/decrease in probability for variable hs_grad_rate : -0.0941203554262382"

It is evident that there are differences between the cities and that these differences impact mayoral outcomes. However, we must keep in mind that the dataset we analyzed is small, and that a few mayoral outcomes could change the parameters of the models and the conclusions of our graphs above. To improve on this analysis, we could scrape together historical information on mayoral outcomes for these cities, and run the models on a larger dataset. We could also scrape more cities outside of the 100 largest ones, and thereby account for selection bias in our dataset. It could very well be the case that mayoral outcomes in the larger cities in the United States may differ dramatically from smaller cities, and that larger cities’s likelihood of electing a female or person of color differs from the rest of the United States’ likelihood of making these same electoral choices. Whatever the case, Chicago is set to make history, and other cities may follow suit in electing mayors of different ethnicities and genders. Once the Chicago mayoral election in April and other mayoral races in big cities conclude, the scraper and these models may need to be rerun to account for new changes in the United States’s political landscape.