Bay Area Geodemographics
Tracking change in San Francisco Bay Area Neighborhoods

Hierachical Clustering, complete linkage, 15 clusters, temporal, without less than 1000 inhabitants tracts.


The social landscape of the Bay Area is changing very rapidly. The debate surrounding gentrification in the Bay Area is a cue to how contentious the issue of neighborhood change is. For a long time, the dichotomy between urban and suburban areas was relevant in describing US neighborhoods. City centers were decaying while suburban areas were getting wealthier and, often, whiter. The “white flight” is now over and, today, city centers are getting more and more attractive, when suburban areas are getting more diverse.


Census data helps understand issue-related neighborhood change: race, income, on top of which lies the issue of gentrification. Geodemographics, on the contrary, consists in combining a set of variables to unravel similar types of neighborhoods and patterns of common processes. Geodemographics use statistical methods to holistically calculate clusters of similar neighborhoods. Detecting the variables that influence the most the clustering process allows to describe the types of neighborhood.

Geodemographics have been widely used as a geomarketing tool aiming at targeting neighborhood for commercial purposes (See Harris, Sleight and Weber, 2005). The efforts by Alex Singleton to create a set of Opengeodemographics have brought this methods to the scientific field. Applications in London and the United Kingdom are very well documented. Alex Singleton and Seth Spielman have also applied Geodemographics to the US and proved the viability of using uncertain ACS census data (American Community Survey, based on samples leading to sometimes important margins of errors) to produce a reliable neighborhood classification.

Tracking Neighborhoods Change

This project aims at expanding previous works using geodemographics data in order to track neighborhood change in the Bay Area and to present the results in an interactive way. The clustering method (described below) has been applied to the same set of variables over several years (from 2000 to 2014) to detect change at the Census tract level.

The names used to characterize the clusters are meant to be the most neutral possible but necessarily suffers from the “ecological fallacy” pointed at by Vickers et alii. The inherently geographical dilemma of choosing the right scale used to sum up the features of an areal space (not to big in order to detect a pattern, not too small to catch a neighborhood effect—and to have reliable data) necessarily leads to a reduction of the complexity of a neighborhood. Labels must be thought of as catchy snapshots that are irrelevant without their more detailed description and their respective visual ‘portraits’ (see graphs).


The set of selected variables is organized in 5 categories covering different aspect of social life: education, family types, population characteristics, type of housing, built environment, economic features.

Out of these raw variables, a bottom-up algorithm is used to create clusters of similar tracts. The used algorithm is called hierarchical clustering and uses the complete linkage feature on an euclidean distance matrix. The algorithm calculates the distance between every tract one with each other on a multifactorial space. Each tract is its own cluster in the beginning. The first cycle agglomerates the two closest tracts to a new cluster. The second cycle agglomerates the two closest clusters and so forth until there is only one cluster left. The dendrogram below graphically displays how each tract gradually agglomerate with its most similar counterpart. The dendrogram is a very convenient tool to decide the number of clusters that produces a meaningful “cut” in the clustering process. The dendrogram here is colored according to the 15 clusters depicted on the map above.

The decision to cut the clustering at 15 relies on the observation of the dendrogram (each cluster is by and large consistent in its size), and also to my empirical understanding of the Bay Area and that of my peers.

To help understanding how tracts that fall within the same clusters are similar, I plotted on a radar chart the distances to the Bay Area mean of each raw variable. The labels and portraits of each cluster have been made out of these visual displays of every categories’ features. Please note that the referential space for the clustering process is that of the Bay Area (therefore the Bay Area mean). Had the same analysis been run on the whole country (same Census tracts level, same variables), the output clusters would have been different (see Spielman and Singleton, 2015). I allowed to use less data than the number recommended by Spielman and Singleton because the Bay Area is populated, used mostly cumulative variables that mitigate the margins of errors, no variable highly subject to huge margins of errors (like native americans).