I’ve been writing an article in which I use a one-dimensional kernel density estimation (KDE). After some thought (and peer review ;-P ) I decided, I needed to visualise how it works.  I couldn’t find any R-code on how to do this online, soooo here it is: My R-code on how to produce a graph which may help explaining KDEs.

We will need ggplot2, because I like ggplot. All the other functions needed belong to {base} R. The plan is the following:

  1. Create an example vector.
  2. Create a normal distribution.
  3. In a loop for each value of the example vector:
    1. Move the normal distribution so that the mean is the i’th value of the example vector (along the x-axis).
    2. Derive a density distribution from the normal distribution and divide it by the length of the example vector. This is necessary, otherwise the density of this distribution will always be higher than of the KDE in the end.
    3. Fill a third column in the data frame with the value of the example vector to create a category for later seperation.
    4. Continuously add the new rows to a data frame
  4. Make a plot, where the example vector gets visualised by points as well as the KDE and plot the extra density distributions calculated in the loop as a dashed line. I also added a grey dotted line to show that the mean of these density distributions is right on the top of the example vector’s values.

So, let’s get started!

# First I create a vector with some values to use as an example.

v <- c(1.2, 4.7, 4.0, 4.8, 4.3, 6.1, 6.7, 5.9, 6.5, 1.7, 0)

# Now I need a normal distribution. 
z <- rnorm(n = 1000, mean = 0, sd = 1)
# I'll need an empty dataframe for the loop
df <- NA

for (i in v)
# Move the distribution along the x-axis
  x <- z + i 
# Create a density distribution divided by the number of data points in the example vector. 
# Chose sd to be the same as bw for kde of example vector in the last plot. 
  y <- dnorm(z, sd = 1)/length(v)
# Make category entries for later division: needs to be repeated 1000 times so it's a vector of the same length as x and y
  i <- rep(as.factor(i), 1000)
# Bind all three vectors to the dataframe and add to old one
  df <- rbind(df, data.frame(x,y,i)) 
# delete the first empty row of the data frame 
df <- df[1,]

Now, let’s plot this!


  geom_density(aes(x = v), bw = 1)+ # the KDE of the example vector
  geom_point(aes(x = v, y = 0), size = 2)+  # point visualisation of this vector
  geom_line(aes(x = df$x, y = df$y, group = df$i),linetype = "dashed" )+ # density dist above these points
  geom_segment(aes(x = v, y = 0, xend = v, yend = max(df$y, na.rm = TRUE)), # line to peak of density dist
               color = "grey",
scale_x_continuous(breaks =0:8, # better scale
  labels = 0:8,
  name = "distance in m")+
  scale_y_continuous(name = "density")+

ggsave("this_is_how_KDE_works.jpg", dpi = 300, width = 20, height = 15, units = "cm")

VoilĂ !

What do you think?

Is there a faster/better/nicer/more elegant way?

… Probably. But it works. 😉

About the Author

My name is Sophie, I am a prehistoric archaeologist and have been research associate at the University of Bonn and the Cologne Digital Archaeology Lab (CoDArchLab) of the Archaeological Institute at the University of Cologne, Germany. I teach statistics for archaeologists, work on new methods in settlement archaeology (GIS, geostatistics in R and stuff) and am interested in archaeogaming.

View Articles