R package 'seg'

The seg package is a collection of segregation measures developed in the geographical and sociological literature. Implemented measures range from simple, traditional methods like the index of dissimilarity (1955) to more sophisticated, spatial approaches such as Reardon and O'Sullivan's spatial segregation measures (2004). If you think I have missed any important methods in the field, or if you have any suggestions or new ideas to make existing features better, please do not hesitate to email me at hong.seongyun [at] gmail.com. Reported bugs will be fixed as soon as possible. All other suggestions will be considered when preparing for the next update. Please note however that I work on this package mainly in my spare time, so updates may not be as frequent as I would like. My aim is to have at least two updates every year (one in May and the other in November).

This page demonstrates the use of dissim(), isp() and spseg() for computing Duncan & Duncan's D, Morrill's D(adj), Wong's D(w) and D(s), White's SP, and Reardon and O'Sullivan's spatial segregation measures. The latest version of seg is available to download from CRAN: http://cran.r-project.org/web/packages/seg/

For version history and instructions on other methods:


1. Hypothetical patterns of segregation

The seg package contains a sample data set of eight different distributions of the population for testing purpose. The data set itself is a simple data frame but it can be displayed on a 10-by-10 grid to reproduce the hypothetical segregation patterns used by Morrill (1991) and Wong (1993).

library(seg)
# Load the sample data set into the current workspace
data(segdata)

# Create a 10-by-10 grid to display the data set
grd <- GridTopology(cellcentre.offset=c(0.5,0.5),
                    cellsize=c(1,1), cells.dim=c(10,10))
grd.sp <- as.SpatialPolygons.GridTopology(grd)

# Display the eight different patterns on one page
par(mfrow = c(2, 4), mar = c(0, 1, 0, 1))
for (i in 1:8) {
  idx <- 2 * i
  full <- segdata[,(idx-1)] == 100
  half <- segdata[,(idx-1)] == 50
  plot(grd.sp)
  plot(grd.sp[full,], col = "Black", add = TRUE)
  if (any(half))
    plot(grd.sp[half,], col = "Grey", add = TRUE)
  text(5, 11.5, labels = paste("segdata[,", idx-1, ":", idx, "]", sep = ""))
}

The black cells are where the minority population comprises 100% of the local population, the grey cells 50%, and the white cells 0%. You can download additional, aggregated grid data sets used by Wong (1993) from the below:


2. Index of dissimilarity, D, and its spatial counterparts

The index of dissimilarity, D, and its spatial counterparts, D(adj), D(w), and D(s) can be calculated in two ways. One is by using the function seg(), and the other is by using the function dissim(). For D, which is insensitive to the spatial arrangements of spatial units (e.g., census tracts), there are practically no differences between these two functions. Both functions require a data frame with two columns, each of which represents a population group, as an input.

seg(data = segdata[,1:2])
# [1] 1
dissim(data = segdata[,1:2])
# $d
# [1] 1
# $dm
# [1] NA
# $dw
# [1] NA
# $ds
# [1] NA
# $user
# [1] NA

The difference arises when we need to calculate its spatial associates proposed by Morrill (1991) and Wong (1993). For example, if you want to adjust D as suggested by Morrill (1991), you should first construct a rook-based, binary contiguity matrix for the study region (where 1 indicates "connected" and 0 "not connected) and standardise it by the total number of neighbours. The function seg() asks you to do this. In R, you can use poly2nb() and nb2mat() in package spdep to create such matrices.

library(spdep)

# Create a rook-based (queen = FALSE), binary (style = "B") contiguity matrix
grd.nb <- nb2mat(poly2nb(grd.sp, queen = FALSE), style = "B")

# Standardise the matrix
grd.nb <- grd.nb / sum(grd.nb)

# Call seg() with an optional argument 'nb'
seg(data = segdata[,1:2], nb = grd.nb)
# [1] 0.9444444

This is not difficult; it requires only a couple of lines. However, this can be annoying if you have to repeat it many times. The function dissim() is a shortcut function that runs the above code for you (provided that all the necessary libraries are already installed on your machine). If you have the spdep package in R, for example, you can get something like this:

dissim(x = grd.sp, data = segdata[,1:2], adjust = TRUE, p2n.args = list(queen = FALSE), n2m.args = list(style = "B"))
# $d
# [1] 1
#
# $dm
# [1] 0.9444444
#
# $dw
# [1] NA
#
# $ds
# [1] NA
#
# $user
# [1] NA

NAs indicate that the required libraries are not available on your machine. For Wong's D(w) and D(s), we need two additional R packages, rgdal and spgrass6, and GRASS GIS, an open-source GIS software.

D(w) is essentially the same as D(adj), in the sense that it adjusts D to reflect the spatial arrangements of areal units. If two adjacent census tracts have very different population compositions, say, 100% Dwarves in one unit and 100% Hobbits in the other, it will decrease the D value slightly. If most of the neighbouring pairs in your data have completely different demographic structures, as shown in the checkerboard-like pattern above (i.e., segdata[,5:6]), the original D value will be substantially reduced.

So why do we need extra tools for calculating D(w) (and D(s) as you will see later)? The difference between D(adj) and D(w) is that the former treats all pairs of neighbours equally while the latter considers those with longer shared boundaries more important. As such, we should know the shared boundary lengths between areal units to calculate D(w).

In R, this is do-able but is not an easy task. See https://stat.ethz.ch/pipermail/r-sig-geo/2005-October/000611.html and https://stat.ethz.ch/pipermail/r-sig-geo/2005-October/000617.html. We use GRASS GIS and the additional libraries to get this geometrical information. With seg(), you need to go through several steps to get D(w):

  1. Use vect2neigh() in package spgrass6 to get the shared boundary lengths
  2. Put them into a matrix
  3. Divide the values in the matrix by the total shared boundary length
  4. Use the standardised matrix as 'nb'
library(spgrass6); library(rgdal)

# Run GRASS in the current R session. Change the path (in red) to where you
# installed GRASS GIS. Don't need this if you started R within GRASS
initGRASS("C:/Program Files (x86)/GRASS 6.4.1", home = tempdir())

rownames(segdata) <- paste("g", 1:100, sep = "")
grd.df <- SpatialPolygonsDataFrame(grd.sp, segdata)
writeVECT6(grd.df, vname = "grdGRASS", v.in.ogr_flags = "o")

# (1) Get the shared boundary lengths
sl <- vect2neigh(vname = "grdGRASS", units = "me")

# (2) Put them into a matrix
sl.mat <- listw2mat(sn2listw(sl))

# (3) Globally standardise the matrix
sl.mat <- sl.mat / sum(sl.mat)

# (4) Calculate D(w)
seg(segdata[,1:2], nb = sl.mat)

D(s) uses the perimeter/area ratio of the units for the adjustment. The areas of the units are already stored in your SpatialPolygons object. The perimeters can be obtained using vect2neigh(). Note that the boundary of the study region is not counted towards the calculation of the perimeters. See p. 566 in Wong (1993) for an explanation.

A <- unlist(lapply(slot(grd.sp, "polygons"), function(z) slot(z, "area")))
P <- attr(sl, "total") - attr(sl, "external")
PAR <- P/A
maxPAR <- max(PAR)

PAR.mat <- matrix(NA, nrow = length(PAR), ncol = length(PAR))
for (i in 1:length(PAR)) {
  for (j in 1:length(PAR))
    PAR.mat[i,j] <- ((PAR[i] + PAR[j])/2) / maxPAR
}

seg(segdata[,1:2], nb = sl.mat * PAR.mat)

As mentioned above, the function dissim() is a wrapper function that calls seg() with the extra code above, so the adjustment is made automatically. All you have to do is to make sure that you have the necessary libraries installed.

library(spgrass6); library(rgdal)
initGRASS("C:/Program Files (x86)/GRASS 6.4.1", home = tempdir())

dissim(x = grd.sp, data = segdata[,1:2], adjust = TRUE, p2n.args = list(queen = FALSE), n2m.args = list(style = "B"))
# $d
# [1] 1
#
# $dm
# [1] 0.9444444
#
# $dw
# [1] 0.9444444
#
# $ds
# [1] 0.9472222
#
# $user
# [1] NA

The table below compares the segregation values in the original article to the ones from the function dissim().

  Package 'seg' Morrill (1991) Wong (1993)
D D(adj) D(w) D(s) D D(adj) D D(adj) D(w) D(s)
1.00 0.94 0.94 0.95 1.00 0.94 1.00 0.94 0.94 0.95
1.00 0.83 0.83 0.84 1.00 0.83 1.00 0.83 0.83 0.84
1.00 0.501 0.50 0.54 1.00 0.48 1.00 0.50 0.50 0.54
0.84 0.791 0.79 0.79 0.83 0.76 NA NA NA NA
0.83 0.66 0.66 0.68 0.83 0.66 NA NA NA NA
1.00 0.97 0.97 0.97 NA NA 1.00 0.97 0.97 0.97
1.00 0.93 0.93 0.93 NA NA 1.00 0.93 0.93 0.93
1.00 0.90 0.90 0.91 NA NA 1.00 0.90 0.90 0.91
1.00 0.50 0.50 0.50 NA NA 1.00 0.50 0.50 0.50
1.00 0.33 0.24 0.54 NA NA 1.00 0.33 0.24 0.54
1.00 0.50 0.70 0.74 NA NA 1.00 0.50 0.70 0.74
1.00 0.572 0.54 0.68 NA NA 1.00 0.50 0.54 0.68
1.00 0.402 0.36 0.613 NA NA 1.00 0.33 0.36 0.57

Small differences (i.e., less than or equal to 0.01) may be due to rounding. The differences greater than 0.01 are highlighted in red and numbered with superscript. The reasons for the discrepancy are as follows:

  1. Rounding problem, maybe? For the third pattern, the result from seg() is consistent with the one in Wong (1993).

  2. This is because seg() (and dissim() as well) uses a different way of counting the neighbouring pairs from Wong (1993). In Wong (1993), the pattern below has eight pairs of neighbours: [1-2], [1-3], [1-5], [2-4], [2-5 through Edge A], [2-5 through Edge B], [3-4], and [4-5]. In four of them, the units have the same population composition (i.e., both units have 100% Dwarves). The other four pairs (i.e., [1-5], [2-5 through Edge A], [2-5 through Edge B], and [4-5]) consist of one unit with 100% Dwarves and the other unit with 100% Hobbits. So, deduct 4/8 = 0.5 from the original D value, 1, and we have 1 - 0.5 = 0.5.

    In seg(), there are only seven pairs of neighbours, not eight: [1-2], [1-3], [1-5], [2-4], [2-5], [3-4], [4-5]. Of the seven pairs, three have one unit with 100% Hobbits and the other with no Hobbits (i.e., [1-5], [2-5], and [4-5]). So D(adj) = 1 - 3/7 = 0.5714.

  3. One possible explanation is that the lengths of shared boarders between census tracts were rounded to the first decimal place during the calculation of D(s) in Wong (1993). If we calculate the index in this way, seg() generates the same result as Wong (1993), up to the second decimal place.


3. Index of spatial proximity

The traditional index of dissimilarity and its spatial associates above consider only two population groups at a time. Considering that many societies are increasingly diverse in terms of race, ethnicity, culture, and religion, this limitation is not desirable. One of the classical measures that can work with multiple groups is the index of spatial proximity, SP, developed by White (1983).

In the seg package, the function to compute this measure of clustering is called isp(). The input for this function must be a spatial object, whether it is a SpatialPolygons object or a matrix with coordinates, because SP does not have an aspatial analog. Another difference between dissim() and isp() is that the data table (i.e., population counts) for isp() does not have to be a matrix with only two columns; it accepts a numeric matrix with more than two columns, as SP can handle multiple population groups.

By default, a simple negative exponential function is used to control how the distance affects the social interactions between people, but different distance decay models can be specified through an optional argument if necessary.


4. Reardon and O'Sullivan's spatial segregation measures

To demonstrate the use of spseg(), I use the 2006 New Zealand Census. I downloaded the ethnic population data by census area units and digital boundaries in shapefile format for 16 city councils1 from http://www.stats.govt.nz/, joined the tables to the shapefiles, and saved the files as a single R object. You can download the file from here and read it into R.

# Assuming that the downloaded file is stored in your working directory
load("nzcity.rda")

It has one 'list' object with 16 components, each of which contains a 'SpatialPolygonsDataFrame' object for one city council. Let's explore the data a little.

names(nzcity)
# [1] "AKC" "CHC" "DUN" "HMT" "INV" "LHT" "MKC" "NAP" "NEL" "NSC" "PAL" "POR" "TAU" "UHT"
# [15] "WAI" "WEL"
nzcity$NSC
# class       : SpatialPolygonsDataFrame
# nfeatures   : 53 
# extent      : 2656261, 2672202, 6483833, 6503050  (xmin, xmax, ymin, ymax)
# coord. ref. : +prj=nzmg 
# nvariables  : 7
# names       :        AU, EUROPEAN, MAORI, PACIFIC, ASIAN, MELAA, OTHER 
# min values  :    Albany,      795,    36,       3,    45,     3,    78 
# max values  : Witheford,     4842,   960,     639,  1944,   438,   768
data.frame(nzcity$NSC)
plot(nzcity$NSC)

The abbreviations used are as follows: AKC - Auckland, CHC - Christchurch, DUN - Dunedin, HMT - Hamilton, INV - Invercargill, LHT - Lower Hutt, MKC - Manukau, NAP - Napier, NEL - Nelson, NSC - North Shore, PAL - Palmerston North, POR - Porirua, TAU - Tauranga, UHT - Upper Hutt, WAI - Waitakere, WEL - Wellington.



1 There are only 13 city councils left in New Zealand, as four city councils, namely, Auckland, Manukau, North Shore, and Waitakere, together with three other district councils, were merged into one super city council "Auckland" in 2010.


References

White MJ. 1983. The measurement of spatial segregation. The American Journal of Sociology 88: 1008-1018.

Morrill RL. 1991. On the measure of geographic segregation. Geography Research Forum 11: 25-36.

Wong DWS. 1993. Spatial indices of segregation. Urban Studies 30: 559-572.