For the last two years I’ve been developing a tutorial on clustering methods in archaeology with some colleagues. It’s published now!
Classification methods are really important in archaeology. With the help of computers we can compare and group objects based on a multitude of descriptive features (attributes). In the winter school KlassifikatoR in Kiel, we talked about different distance measures, clustering methods and a great validation algorithm. From the beginning it had been the aim of the winter school to create teaching material for others. Now — two years later and completed completely during a world-wide pandemic — we finally decided that the tutorial is *fine* now and released it. You can download the html from this DOI 10.5281/zenodo.6325372.
Quick run through the tutorial
Let me give you quick tour through the main points in the tutorial. The tutorial is aimed at archaeologists with some R skills, who want to learn more about classification. Nonetheless if you want to do a cluster analysis in a different software, it is still for you! We really dive into the background of the methods and these explanations will be helpful to anyone (we hope).
In this part we introduce what classification is and why it is important to archaeology. We show how there are different ways to classify data in general and define underlying terms. It’s the theoretical background to what follows. I recommend reading it first, especially if you may be unfamiliar with some concepts.
Distance measures are very important, because all of your classification relies on how you think about similarity and dissimilarity (“distance”). There are different ways to measure distance depending on the scales of measurement of your data. Also, consider this: Are two objects more similar, if they share the absence of a feature? Or is the absence of a feature unimportant? There are ways to deal with questions like this. And you should always be able to explain why you chose a certain distance measure. But don’t worry! This tutorial has got you covered with easy-to-understand archaeological examples.
There is a variety of clustering methods. In the tutorial, we focus on three: hierarchical clustering, k-means and hdbscan.
Hierarchical clustering gives you “cluster trees”, that show how clusters relate to each other hierarchically. They depend on different ways to compare distance between clusters (“linking”). The hierarchical cluster tree has not “decided yet”, which clusters to form. You are able to devise different groups from your data with the given information. Sometimes, it can be difficult to decide where to “cut the tree” and to determine the number of clusters that fit your narrative best.
k-means works differently: Here you need to set beforehand how many clusters you want to find in your data. Then there is an iterative process, where the algorithm tries to find the best way to fit your data into this (k) number of categories. This often leads to “spherical” clusters. Also, again, it is often difficult to decide on the number of clusters beforehand.
Hierarchical Density Based Clustering of Applications with Noise — HDBSCAN — deals with this problem in a different way. Here you don’t set the number of clusters, but how many points a cluster should at least have. It constructs a kind of clustering tree and checks, which of the clusters are most stable. These are then chosen to represent your data.
So, as you can see: These are quite different approaches. We chose them, because they are the ones best known in archaeology. There are other as well, see e.g. my post on percolation. If you apply more than one clustering approach, it may happen that you find different groups in your data. To help you decide, which method might be the best for your data, we explain a validation method in the tutorial:
There are different validation methods out there, but the one we feature in the tutorial has been determined to work best: silhouette.
The silhouette method allows to compare how well different groups cluster and how well one point fits into the cluster it has been assigned to. It is based on calculating the mean distance of a point to all other members of its cluster. Also, the mean distance of its cluster to all other clusters is calculated. By comparing these values with a little bit of math, you can determine whether a point is assigned well to its cluster, lies right between two clusters or should probably be in another cluster.
Sounds probably more complicated than it is — which you will learn, if you follow our tutorial. 😉
Open Educational Resources
I’m very happy we decided to put this tutorial online to be accessible to everyone. We wrote it in Rmarkdown (check out the tutorial gitlab space), but created an html file for you to download and open in a browser. This html is interactive and full of internal links. They enable you to jump back and forth, e.g. to the introductory explanations to refresh your memory. Also, we included graphs that show the workflows we explain. A workflow will help you decide which steps to take with your kind of data.
We uploaded the html and the compendium to zenodo to get a stable DOI. That’s the link for the zenodo repo: https://doi.org/10.5281/zenodo.632537. This means it’s published in a stable way and you can properly cite the tutorial. Nonetheless, it’s under an open license. So, everyone is invited to re-use it. Just, please, cite us as:
Schmidt, Sophie C., Martini, Sarah, Staniuk, Robert, Quatrelivre, Carole, Hinz, Martin, Nakoinz, Oliver, Bilger, Michael, Roth, Georg, & Laabs, Julian. (2022, March 3). Tutorial on Classification in Archaeology: Distance Matrices, Clustering Methods and Validation. Zenodo. https://doi.org/10.5281/zenodo.6325372
Thanks so much to all my co-authors. It was a joy working with you! Also a huge thank you to those who read the tutorial before release and gave us invaluable hints on what to improve!
We would love some feedback still! If you try our tutorial and have an opinion, please open an issue in the gitlab or write an email to any of us or comment here.