Where do we draw the line?

Using data science to explore diversity of handwriting styles

Have you ever wondered how many different ways people might write or express the same sentence? Handwriting has been used for centuries as a means of communication and a physical way of expressing thoughts. Although written communication today is often done digitally, scholars still insist on the many benefits of physical handwriting on cognitive development, literacy and enabling us to create mental pictures of the world. In fact, recent evidence has suggested that there may even be a relationship between our handwriting style and key personality traits.

At the Turing, while we’ve been virtually connecting throughout lockdown, we took a moment to disconnect from our keyboards to explore our diverse handwriting styles. We are proud to have taken part in National Inclusion Week from 28 September to 4 October 2020, an event created by Inclusive Employers to celebrate diversity and raise awareness of inclusion in the workplace.

During the week, we collected samples of handwriting styles across the Turing of the phrase "Each One Reach One", the theme of the week. We were interested in exploring whether we could visualise a diverse set of handwriting styles that might exist across the Institute.

A data science spin on National Inclusion Week: using autoencoders and dimensionality reduction techniques to visualise a web of diverse handwriting styles at the Turing.

Teaching machines to read our writing

Humans can naturally identify and recognise distinct styles and components of art, such as colour and pattern. Our handwriting consists of a multitude of unique characteristics that we can recognise with ease when reading another’s handwriting.

However, when we ask machines to do this digitally things become more complicated. Interpreting style from merely a grid of pixels is not a straightforward task. How can we teach machines to recognise style in order to explore how diverse our handwriting styles really are?

This brings us to recent advancements in deep learning, in particular convolutional autoencoders, which can learn to efficiently extract and encode visual features of an image in an unsupervised manner. They achieve this by attempting to reconstruct the original image from a compact representation they learn.

For this reason, they are often used in applications such as denoising and compression of images. In our case, we train a convolutional autoencoder to learn a low-dimensional representation of someone's handwriting from an input image of the same phrase (“Each One Reach One”).

The phrase 'Each One Reach One' written in three different handwriting styles by members of The Alan Turing Institute community
Example handwriting samples of National Inclusion Week’s 2020 theme, “Each One Reach One”

Sharing the same phrase should force the model to focus on stylistic features, such as slanted-ness of writing, spacing or letter casing. That is, how it is written rather than what is written.

If the model can successfully reconstruct someone’s handwriting, then it might have learned informative characteristics that underpin style. We can then use these representations as a proxy for style of handwriting. What does this representation look like? The model learns to compress each image into a fixed-sized vector of 48-dimensions as well as decompress it back into the original image.

Having obtained a representation of each handwriting sample in a way that might encode its style, we can now explore how diversely a single phrase can be written. Alas, even though our encoded images of handwriting are smaller in size (48 versus 100 x 400 pixels), they are still too large to visualise on a 2-dimensional plot. To solve this issue, we use dimensionality reduction algorithms, UMAP and t-SNE, to project the representations of style onto a 2-dimensional plot whilst retaining the distance between samples (and hence, styles) in the learned representations.

Each point in the visualisation is therefore a sample of handwriting that's been encoded by the model and positioned based on their distance to other samples by UMAP or t-SNE. Styles that share features will be positioned closer on the plot. Each algorithm has unique properties which may result in different groupings and positioning of styles on the plot. For example, UMAP attempts to preserve global structure more strongly than t-SNE (e.g., the distance between separate clusters of points in addition to the distance between neighbouring points).

Connecting the dots

Finally, to aid visual interpretation, the points have also been coloured according to their average distance to other points using K-means, a common clustering algorithm used to automatically group data points. The code for the visualisation is available on GitHub

You might notice some familiar fonts in the visualisation too, such as Times New Roman and Arial, which we included to showcase computerised styles alongside human ones.

So, what does the plot show? We observe a cloud of different handwriting styles, each with a unique place on the plot (hover over the points to see the original handwriting). Although in this analysis we don't know which exact features the model learns as informative towards assessing style, based on the distance between points we can see that these might include curliness or thickness of writing, for example.

It’s been great to put a data science spin on National Inclusion Week and to learn more about our colleagues. I'd like to thank these wonderful colleagues, in particular Khanisa Riaz, for their helpful comments and suggestions on the visualisation. This exploration also reflects the Turing’s approach to projects where everyone is encouraged to bring their own style. Despite all participants writing the same phrase, it is exciting to see how diverse the styles of handwriting across the Turing are!