XDM Data Orchestration : Anonymization

This guide outlines best practices for GDPR- and BDSG-compliant modification and anonymization of data fields within XDM. It includes practical recommendations for building lookup tables, preparing data processing workflows, and handling error scenarios effectively.

Recommendations for Planning and Implementation Decisions

From a data protection perspective, it is generally recommended to anonymize all fields containing personal data when creating test datasets. Ideally, such data should not be included in test environments in order to meet data minimization and privacy requirements.

As a guideline, it is advisable to define only as many content rules for anonymization as necessary to ensure meaningful test operation. In particular, masking should be used carefully and as few constraints as possible should be applied. This is intended to ensure meaningful data protection on the one hand, while fulfilling the application rules and maintaining meaningful recognizability for the data on the other.

Overly granular masking that takes into account as many special cases as possible might be result in a masking project in which complexity and effort increase disproportionately to the quality of the masking. Edge cases can be handled separately through targeted test datasets, without being embedded in the full anonymization rule set.

Step 1: Defining the Anonymization Strategy by Attribute Class

The first step involves working with attribute classes — for example, name, street, city, case number, etc. For each attribute class, a fundamental decision should be made regarding the target values to be used during anonymization. The chosen target model should strike a balance between the desired level of complexity and the needs of the test data. It is important to retain only as many real-world cases and data characteristics from production as are necessary for meaningful testing.

The more constraints defined per attribute class (e.g., gender-preserving or language-consistent name replacement, rules on the number of name components), the more complex the implementation tends to become. This applies both to general development and to handling individual attribute instances — especially when contextual information required for decision-making may be missing.

Decisions made during this step are global in nature and have a significant impact on the overall effort required for the anonymization project. It is therefore recommended to make these fundamental decisions as early as possible in the project lifecycle. However, some rule implications may only become fully apparent later — for example, when additional attribute instances are identified as the project progresses.

Step 2: Implementation at the Level of Individual Attribute Instances

In practical implementation, it is advisable to apply the following decision logic for each attribute instance:

For attributes that appear centrally in the test data (e.g., in identifying interfaces and outputs), the full anonymization rule set should be applied.
For instances directly linked to central attributes (such as subviews or detail views), values should be changed consistently to maintain a coherent and traceable data basis.
For peripheral instances (e.g., automatically generated documents or repeated entries in summaries), the effort required for consistent replacement should be assessed on a case-by-case basis. If the contextual information needed for consistent masking is unavailable or difficult to obtain, it may be appropriate to either delete the affected values (e.g., replace with an empty string) or substitute them with neutral placeholders (e.g., XXX).

This approach enables a fast and practical initial masking of data for productive use in the test environment. If later testing reveals that more consistent masking is necessary in specific areas, it is recommended to refine the masking logic accordingly at those points.

This iterative method typically leads to a reliable and GDPR-compliant state of test data relatively quickly, and allows for ongoing improvement step by step. It is usually more effective than relying on non-anonymized data during development or foregoing test cases altogether.

1. Building Lookup Tables with XDM

To anonymize sensitive data such as names or addresses, lookup tables are used. This allows real values to be replaced with meaningful but anonymized substitutes.

Lookup tables should be structured based on relevant criteria (e.g., country, gender). Each category should contain unique, sequentially numbered replacement values.

Table 1. Example Lookup Table for First Names
Country	Gender	Number	First name
Germany	male	1	Jonas
Germany	male	2	Leon
Germany	female	1	Anna
Germany	female	2	Marie
Netherlands	male	1	Noah
Netherlands	female	1	Emma

Error case: If no name is available for Germany, non-binary, a default value such as Alex is used.

2. Collecting Data Before Copying

To efficiently use various pieces of information for later anonymization, required attributes should first be collected in an intermediate table.

The data should be assembled using a unique key (e.g., Personal ID). The intermediate table holds all relevant information.

Table 2. Example Intermediate Table
Personal ID	Gender	Country of Birth
4711	male	Germany
2455	female	Netherlands

3. Anonymizing Data in a Dependency-Aware Sequence

Some columns depend on each other for anonymization. The base column should be anonymized first; dependent fields can then be populated using the derived value.

Always anonymize the base column first, then populate any derived fields accordingly.

Table 3. Example of Dependent Anonymization
First name (original)	First name (anonymized)	Display name (anonymized)
Anna	Emma	Emma Jansen
Jonas	Leon	Leon Musterfrau

4. Building Replacement Mappings

If identical replacements are needed frequently, a structured mapping list with separate old and new values is recommended.

Original and replacement data should be stored in two separate lists or tables, linked via a technical key (e.g., Hash value).

Table 4. Original Value List
Hash value	Original value
1834abcde123…	Müller
9f6bc12dccc2…	Schmidt

Table 5. Replacement List
Hash value	Replacement value
1834abcde123…	Bauer
9f6bc12dccc2…	Weber

5. Replacing Data vs. Generating New Values

Before anonymization, it should be considered whether a value needs to be replaced consistently across all instances or if new (random) values should be generated for each occurrence.

Consistent Anonymization: A given original value receives the same replacement system-wide. Random Generation: Each value is anonymized independently of other entries.

Original

Replaced by

Müller

Bauer

Müller

Bauer

Table 6. Example of Random Replacement (different name each time)
Original	Replaced by
Müller	Bauer
Müller	Weber

Error Cases (Default Values)

If a value is missing for a replacement or lookup, a predefined default value will be used.

Table 7. Example of Default Values
Original country / gender	Replaced name
Norway / male	NN
Germany / non-binary	Alex