XDM Data Orchestration : Anonymization
This guide outlines best practices for GDPR- and BDSG-compliant modification and anonymization of data fields within XDM. It includes practical recommendations for building lookup tables, preparing data processing workflows, and handling error scenarios effectively.
Recommendations for Planning and Implementation Decisions
From a data protection perspective, it is generally recommended to anonymize all fields containing personal data when creating test datasets. Ideally, such data should not be included in test environments in order to meet data minimization and privacy requirements.
As a guideline, it is advisable to define only as many content rules for anonymization as necessary to ensure meaningful test operation. In particular, masking should be used carefully and as few constraints as possible should be applied. This is intended to ensure meaningful data protection on the one hand, while fulfilling the application rules and maintaining meaningful recognizability for the data on the other.
Overly granular masking that takes into account as many special cases as possible might be result in a masking project in which complexity and effort increase disproportionately to the quality of the masking. Edge cases can be handled separately through targeted test datasets, without being embedded in the full anonymization rule set.
Step 1: Defining the Anonymization Strategy by Attribute Class
The first step involves working with attribute classes — for example, name, street, city, case number, etc. For each attribute class, a fundamental decision should be made regarding the target values to be used during anonymization. The chosen target model should strike a balance between the desired level of complexity and the needs of the test data. It is important to retain only as many real-world cases and data characteristics from production as are necessary for meaningful testing.
The more constraints defined per attribute class (e.g., gender-preserving or language-consistent name replacement, rules on the number of name components), the more complex the implementation tends to become. This applies both to general development and to handling individual attribute instances — especially when contextual information required for decision-making may be missing.
Decisions made during this step are global in nature and have a significant impact on the overall effort required for the anonymization project. It is therefore recommended to make these fundamental decisions as early as possible in the project lifecycle. However, some rule implications may only become fully apparent later — for example, when additional attribute instances are identified as the project progresses.
Step 2: Implementation at the Level of Individual Attribute Instances
In practical implementation, it is advisable to apply the following decision logic for each attribute instance:
-
For attributes that appear centrally in the test data (e.g., in identifying interfaces and outputs), the full anonymization rule set should be applied.
-
For instances directly linked to central attributes (such as subviews or detail views), values should be changed consistently to maintain a coherent and traceable data basis.
-
For peripheral instances (e.g., automatically generated documents or repeated entries in summaries), the effort required for consistent replacement should be assessed on a case-by-case basis. If the contextual information needed for consistent masking is unavailable or difficult to obtain, it may be appropriate to either delete the affected values (e.g., replace with an empty string) or substitute them with neutral placeholders (e.g.,
XXX).
This approach enables a fast and practical initial masking of data for productive use in the test environment. If later testing reveals that more consistent masking is necessary in specific areas, it is recommended to refine the masking logic accordingly at those points.
This iterative method typically leads to a reliable and GDPR-compliant state of test data relatively quickly, and allows for ongoing improvement step by step. It is usually more effective than relying on non-anonymized data during development or foregoing test cases altogether.
1. Building Lookup Tables with XDM
To anonymize sensitive data such as names or addresses, lookup tables are used. This allows real values to be replaced with meaningful but anonymized substitutes.
Lookup tables should be structured based on relevant criteria (e.g., country, gender). Each category should contain unique, sequentially numbered replacement values.
Country |
Gender |
Number |
First name |
Germany |
male |
1 |
Jonas |
Germany |
male |
2 |
Leon |
Germany |
female |
1 |
Anna |
Germany |
female |
2 |
Marie |
Netherlands |
male |
1 |
Noah |
Netherlands |
female |
1 |
Emma |
Error case: If no name is available for Germany, non-binary, a default value such as Alex is used.
2. Collecting Data Before Copying
To efficiently use various pieces of information for later anonymization, required attributes should first be collected in an intermediate table.
The data should be assembled using a unique key (e.g., Personal ID). The intermediate table holds all relevant information.
Personal ID |
Gender |
Country of Birth |
4711 |
male |
Germany |
2455 |
female |
Netherlands |
3. Anonymizing Data in a Dependency-Aware Sequence
Some columns depend on each other for anonymization. The base column should be anonymized first; dependent fields can then be populated using the derived value.
Always anonymize the base column first, then populate any derived fields accordingly.
First name (original) |
First name (anonymized) |
Display name (anonymized) |
Anna |
Emma |
Emma Jansen |
Jonas |
Leon |
Leon Musterfrau |
4. Building Replacement Mappings
If identical replacements are needed frequently, a structured mapping list with separate old and new values is recommended.
Original and replacement data should be stored in two separate lists or tables, linked via a technical key (e.g., Hash value).
Hash value |
Original value |
1834abcde123… |
Müller |
9f6bc12dccc2… |
Schmidt |
Hash value |
Replacement value |
1834abcde123… |
Bauer |
9f6bc12dccc2… |
Weber |
5. Replacing Data vs. Generating New Values
Before anonymization, it should be considered whether a value needs to be replaced consistently across all instances or if new (random) values should be generated for each occurrence.
Consistent Anonymization: A given original value receives the same replacement system-wide. Random Generation: Each value is anonymized independently of other entries.
Original |
Replaced by |
Müller |
Bauer |
Müller |
Bauer |
Original |
Replaced by |
Müller |
Bauer |
Müller |
Weber |