XDM Data Orchestration : Anonymization

This guide outlines best practices for GDPR- and BDSG-compliant modification and anonymization of data fields within XDM. It includes practical recommendations for building lookup tables, preparing data processing workflows, and handling error scenarios effectively.

Recommendations for Planning and Implementation Decisions

From a data protection perspective, it is generally recommended to anonymize all fields containing personal data when creating test datasets. Ideally, such data should not be included in test environments in order to meet data minimization and privacy requirements.

As a guideline, it is advisable to define only as many content rules for anonymization as necessary to ensure meaningful test operation. In particular, masking should be used carefully and as few constraints as possible should be applied. This is intended to ensure meaningful data protection on the one hand, while fulfilling the application rules and maintaining meaningful recognizability for the data on the other.

Overly granular masking that takes into account as many special cases as possible might be result in a masking project in which complexity and effort increase disproportionately to the quality of the masking. Edge cases can be handled separately through targeted test datasets, without being embedded in the full anonymization rule set.

Step 1: Defining the Anonymization Strategy by Attribute Class

The first step involves working with attribute classes — for example, name, street, city, case number, etc. For each attribute class, a fundamental decision should be made regarding the target values to be used during anonymization. The chosen target model should strike a balance between the desired level of complexity and the needs of the test data. It is important to retain only as many real-world cases and data characteristics from production as are necessary for meaningful testing.

The more constraints defined per attribute class (e.g., gender-preserving or language-consistent name replacement, rules on the number of name components), the more complex the implementation tends to become. This applies both to general development and to handling individual attribute instances — especially when contextual information required for decision-making may be missing.

Decisions made during this step are global in nature and have a significant impact on the overall effort required for the anonymization project. It is therefore recommended to make these fundamental decisions as early as possible in the project lifecycle. However, some rule implications may only become fully apparent later — for example, when additional attribute instances are identified as the project progresses.

Step 2: Implementation at the Level of Individual Attribute Instances

In practical implementation, it is advisable to apply the following decision logic for each attribute instance:

  • For attributes that appear centrally in the test data (e.g., in identifying interfaces and outputs), the full anonymization rule set should be applied.

  • For instances directly linked to central attributes (such as subviews or detail views), values should be changed consistently to maintain a coherent and traceable data basis.

  • For peripheral instances (e.g., automatically generated documents or repeated entries in summaries), the effort required for consistent replacement should be assessed on a case-by-case basis. If the contextual information needed for consistent masking is unavailable or difficult to obtain, it may be appropriate to either delete the affected values (e.g., replace with an empty string) or substitute them with neutral placeholders (e.g., XXX).

This approach enables a fast and practical initial masking of data for productive use in the test environment. If later testing reveals that more consistent masking is necessary in specific areas, it is recommended to refine the masking logic accordingly at those points.

This iterative method typically leads to a reliable and GDPR-compliant state of test data relatively quickly, and allows for ongoing improvement step by step. It is usually more effective than relying on non-anonymized data during development or foregoing test cases altogether.

1. Building Lookup Tables with XDM

To anonymize sensitive data such as names or addresses, lookup tables are used. This allows real values to be replaced with meaningful but anonymized substitutes.

Lookup tables should be structured based on relevant criteria (e.g., country, gender). Each category should contain unique, sequentially numbered replacement values.

Table 1. Example Lookup Table for First Names

Country

Gender

Number

First name

Germany

male

1

Jonas

Germany

male

2

Leon

Germany

female

1

Anna

Germany

female

2

Marie

Netherlands

male

1

Noah

Netherlands

female

1

Emma

Error case: If no name is available for Germany, non-binary, a default value such as Alex is used.

2. Collecting Data Before Copying

To efficiently use various pieces of information for later anonymization, required attributes should first be collected in an intermediate table.

The data should be assembled using a unique key (e.g., Personal ID). The intermediate table holds all relevant information.

Table 2. Example Intermediate Table

Personal ID

Gender

Country of Birth

4711

male

Germany

2455

female

Netherlands

3. Anonymizing Data in a Dependency-Aware Sequence

Some columns depend on each other for anonymization. The base column should be anonymized first; dependent fields can then be populated using the derived value.

Always anonymize the base column first, then populate any derived fields accordingly.

Table 3. Example of Dependent Anonymization

First name (original)

First name (anonymized)

Display name (anonymized)

Anna

Emma

Emma Jansen

Jonas

Leon

Leon Musterfrau

4. Building Replacement Mappings

If identical replacements are needed frequently, a structured mapping list with separate old and new values is recommended.

Original and replacement data should be stored in two separate lists or tables, linked via a technical key (e.g., Hash value).

Table 4. Original Value List

Hash value

Original value

1834abcde123…​

Müller

9f6bc12dccc2…​

Schmidt

Table 5. Replacement List

Hash value

Replacement value

1834abcde123…​

Bauer

9f6bc12dccc2…​

Weber

5. Replacing Data vs. Generating New Values

Before anonymization, it should be considered whether a value needs to be replaced consistently across all instances or if new (random) values should be generated for each occurrence.

Consistent Anonymization: A given original value receives the same replacement system-wide. Random Generation: Each value is anonymized independently of other entries.

Original

Replaced by

Müller

Bauer

Müller

Bauer

Table 6. Example of Random Replacement (different name each time)

Original

Replaced by

Müller

Bauer

Müller

Weber

Error Cases (Default Values)

If a value is missing for a replacement or lookup, a predefined default value will be used.

Table 7. Example of Default Values

Original country / gender

Replaced name

Norway / male

NN

Germany / non-binary

Alex