Simon Maddison and I published a paper on a clustering method called “percolation analysis” and I am proud to say, we put the algorithms in a R-package and published that one as well! Let me tell you a bit about what we did…
Some time ago, Simon and I were the two lone people working on prehistoric settlement point pattern analysis in a big session called “Cities of Data: computational and quantitative advances to urban research” at the international CAA conference in Tübingen. Yeah, maybe “urban” should have warned us that people would be talking about Space Syntax, 3D GIS based analysis, spatial interaction models, and not “just” about 2D spatial clustering.
How to find co-authors
Simon had presented his percolation analysis with the Atlas of Hillforts data set and explained he’d done the analysis in R. Me, pricking my ears at the mention of R, asked him in the coffee break whether I may have his scripts so I could use his analysis on my data. Kindly he agreed. When I finally got around trying the scripts, I managed to run a percolation analysis on my data after just a couple of hours. As my data set was so different from his — I was looking at features excavated along a narrow transect, he was analysing hillforts all over Britain — I asked him whether he’d be interested in writing a paper in which we discussed the method and how it can be applied to very different geographical scales. And he was!
So the next 1 1/2 years we wrote our case studies, compared percolation analysis to other well known clustering algorithms in archaeology and created functions out of the scripts, put the functions in a proper R-package, tested, tested, tested, wrote some documentation, re-wrote the documentation, tested again and finally submitted the paper to the JCAA. The reviewers feedback was encouraging and very helpful in realising which parts in the package needed to be redone and which parts of the paper weren’t quite perfect yet and after some rewriting we finally published it open access in the Journal of the CAA in 2020. Yay!
A short summary of the percolation package
So, let me give you a short overview of our R package.
What is percolation?
Percolation analysis is a way to cluster spatially distributed points. Theoretically it takes one point and a given radius around this point and checks, which other points fall into this radius around the point. These points are now “infected with belonging in this cluster” and around them a radius is drawn as well. Any point lying in the radius, gets to be part of the cluster and so on…
Mathematically, we calculate the distance of all points to each other and save only those distances falling under a certain threshold (given by “limit” in the function). Saving the distances of all points to all points just leads to huge datasets and usually we don’t need the very large distances for defining clusters. In the next step we use the distances to “trace” the development of clusters through the data set. So, if the radius is set to, say hypothetically, 50 m, all points that are less than 50 m apart from each other will form a cluster.
The percolation function explained
Now the clue in the function is, that you don’t give just one radius to try: By default you are asked to give a range of radii: an
lower_radius and a
step_value will determine all the radii the function will run through. So if I give
upper_radius = 100,
lower_radius = 10, step_value = 10, it will calculate radii of 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100. Whether this is meter or kilometer is something you can determine by giving
unit = 1 (meter) or
unit = 1000 (kilometer). In a large table all the results which point belongs to which cluster at which radius is saved.
We can use this data set with the next functions which I’ll explain in a moment. Just a short note: Instead of letting our function calculate the distances between your data points (input is a dataframe with three columns: PlcIndex, Easting and Northing), you can also give an already calculated distance table via
distance_table = df . As a default we set
distance_table = NULL and let the algorithm do the calculation.
All in all the function will look like:
percolate(data, , upper_radius = 100, lower_radius = 10, step_value = 10, limit = 110, unit = 1)
Of course we now want to see how these cluster look like! So we created the mapping-function. Because we believe every map should have a name and an information where the data comes from, we included two extra parameters next to the shape-file which is used as a background-map. You can also determine the dpi, which is set to 300 as a default:
mapClusters(shape = background_map, source_file_name = "The source data is from xx", map_file_name = "Mapping archaeological sites in 'insert name here'")
This leads to an output of a) one map, which plots all points, just for you to check and b) a map for every radius you’ve set. In the example this means 11 maps are printed altogether. That’s great, now we can look at the clusters, develop hypothesis what they might mean, whether they correlate with spatial features etc pp…
I show here Simon’s analyses results, because they are much easier to recognise then mine: Everyone knows Great Britain, but I guess very few people know about the road B6n near Köthen in Saxony-Anhalt… 😉
Analysing the results
We also included another analysis tool. A function, which will create three plots showing the radius on the x-axis and on the y-axis:
- a) the mean cluster size
- b) the maximum cluster size
- c) the normalised maximum cluster size.
These can be used to discuss why which radius might lead to the clusters you wish. It’s fairly simple and doesn’t need the
source_file_name if you’ve already run the mapClusters-function, because that value will be stored.
PlotClustFreq(source_file_name = "The source data is from xx")
For illustration purposes we added maps showing the clusters at relevant percolation radii to these plots:
In the paper Simon shows that at some percolation radii very interesting cluster emerge that co-incide with geographical structures. I look at how close features lie to each other in my road excavation data set and find similar measures as documented in large-scale excavations. We therefore believe that percolation analysis is a useful approach at very different scales. Also, in comparison to other clustering algorithms, it has a simple mathematical core and is easily understandable. That is useful for archaeologists who need to be able to explain why certain features emerge in their analyses. It’s now easily applicable, if you use R at least, and a worthwhile addition to the archaeological toolkit!
Wanna try it?
In the R-package you will find a vignette and help pages for all functions, for the package itself and the example data set we gave. You can find the package here, download, install and try it yourself! I’d be delighted to hear of your work with the package, whether everything runs smoothly and what you think we could improve!
Percolation is cool, R is cool, take a look at this cool new package we made! 😉
Actually, the most important point is: Talk to people on conferences. No, really. Just chat with them. If I hadn’t dared to chat with Simon and asked for his code, I wouldn’t have gotten to know such a kind and generous person, I would not have learned how to make R-packages now and I would have one article less on my publication list. And that’s ranked by importance.