NetAudit – Interpretable Embeddings for the Dutch Population Network

22 April 2025

Written by Megha Khosla and Malte Luken

Network analysis is an increasingly vital tool in the social sciences. It enables researchers to study how information, behaviours, and attitudes spread through social structures. At the core of this approach is the idea of social opportunity, which we define as the potential for individuals to form relationships based on their shared contexts. Statistics Netherlands provides a unique and powerful resource for such analysis: a full-scale population network of the Netherlands[1]. This network includes contextual links such as neighbours, colleagues, classmates, family members, and household members, offering researchers insight into the opportunity for interaction and not necessarily the interaction itself.

In parallel, machine learning has introduced tools like embeddings to represent complex data (such as text or networks) as low-dimensional numeric vectors. These embeddings can capture meaningful patterns and are commonly used for tasks such as similarity search or attribute prediction. In the NetAudit project, we bring these two worlds together by learning embeddings for the entire Dutch population network. In this model, each individual is represented by a D-dimensional vector that reflects their structural position within the national social network.

Figure 1  illustrates this idea in action. Just as natural language models learn to embed words based on the context they appear in, network embedding techniques [2] embed individuals based on their “network context”—the people they are connected to via various social opportunities. These embeddings allow us to do more than just visualise structure—they provide the foundation for predicting meaningful attributes, such as unemployment risk or educational attainment.

A white circle in the sky

AI-generated content may be incorrect.
Figure 1, SEQ Figure \* ARABIC 1: Network embedding techniques learn vector representations for nodes (here Bob and Alice) based on their positions in the network.

However, one challenge remains: interpretability. Unlike traditional social science variables, embedding dimensions often lack clear meaning. To address this, we applied a transformation that makes the dimensions sparse and orthogonal, ensuring they capture distinct and interpretable aspects of the population network. This makes the embeddings more useful not only for the prediction tasks, but also for exploratory research and hypothesis generation.

In this blog post, we briefly explain how we created and transformed the embeddings. We conclude by discussing the potential of the embeddings for social science research. The untransformed and transformed population network embeddings are available for the years 2020, 2021, and 2022 within the secure remote access environment by Statistics Netherlands through the Storage Facility (in collaboration with ODISSEI). The code to create and transform the embeddings is available on GitHub.

Creating Population Network Embeddings

We used two different approaches to create the node embeddings. The first approach is called DeepWalk [3] and borrows ideas from embedding methods for text. The first idea is that we can randomly draw sequences of connected nodes from a network and treat them like sentences in a text. These sequences are called random walks. The second idea is that if we train a shallow neural network with an embedding layer to predict nodes based on their surrounding nodes in a random walk, the embedding layer will encode the node position. This means that neighboring nodes but also nodes with a similar neighborhood structure will have similar embeddings. In contrast, nodes in segregated parts of the population network will have different embeddings.

The second approach is called Large-scale Information Network Embedding (LINE) [4]. It uses a shallow neural network with two different embedding layers. The first layer contains embeddings that reflect the proximity to other nodes in the population network (first-order embeddings). This means that neighboring nodes have similar first-order embeddings. The second embedding layer represents the context similarity in the network (second-order embeddings). Consequently, nodes with a similar neighborhood structure such as the number of neighbors have similar second-order embeddings. 

While both methods yield similar embeddings, separating proximity and context similarity in the LINE embeddings can be beneficial for some applications.

Making Embedding Dimensions More Interpretable

In network embedding models, each node’s position in the network is encoded as a vector of numbers—its embedding. However, the information about a node’s structural role is typically distributed across all embedding dimensions, meaning that individual dimensions are not directly interpretable. For example, we cannot assume that one dimension corresponds to geographical location—people living in the North of the Netherlands won’t necessarily have high values in the same embedding coordinate.

To improve interpretability, we applied a method inspired by the Dimensional Interpretability of Node Embeddings (DINE) approach [5]. This involves training a denoising autoencoder that transforms the embeddings to make the dimensions sparse and orthogonal. In simple terms, this transformation encourages the network to “push” distinct types of structural information into different dimensions. As a result, one dimension might predominantly capture geographic clustering, while another might reflect household or work-based relationships.

Yet, this new structure may not always guarantee that any one dimension corresponds to a known real-world variable. Interpreting the dimensions must be done post hoc, for example, by computing correlations between embedding values and known node-level attributes such as age, occupation, or location.

The Potential of Population Network Embeddings

We believe that interpretable embeddings of the Dutch population network have great potential for social science research, and we want to encourage researchers to use them. In the NetAudit project, we have adopted right-wing populist voting behavior as a use case (we are still in the process of preparing a manuscript and hope to share a preprint soon). By linking the network embeddings with survey responses from the LISS panel data, we investigate whether embeddings predict right-wing populist voting and whether they contain information beyond socio-economic status and personality traits. Besides using the embeddings to predict individual-level outcomes, other interesting applications might be to look at the change of specific embedding dimensions over time or use them to construct novel communities.

References

[1] J. Van der Laan, E. de Jonge, M. Das, S. Te Riele, and T. Emery, “A whole population network and its application for the social sciences,” European sociological review, vol. 39, no. 1, pp. 145–160, 2023.

[2] M. Khosla, V. Setty, and A. Anand, “A comparative study for unsupervised network representation learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 1807–1818, 2019.

[3] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, New York New York USA: ACM, Aug. 2014, pp. 701–710. doi: 10.1145/2623330.2623732.

[4] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE: Large-scale Information Network Embedding,” in Proceedings of the 24th International Conference on World Wide Web, Florence Italy: International World Wide Web Conferences Steering Committee, May 2015, pp. 1067–1077. doi: 10.1145/2736277.2741093.

[5] S. Piaggesi, M. Khosla, A. Panisson, and A. Anand, “Dine: Dimensional interpretability of node embeddings,” IEEE Transactions on Knowledge and Data Engineering, 2024. Available: https://ieeexplore.ieee.org/abstract/document/10591463/

Picture by Planet Volumes For Unsplash+