Creating a Mapping Table for Data Shuffling

Overview

This guide describes how to create a mapping table for data shuffling in XDM using Groovy scripting. The primary goal is to permute, for example, last names within a table so that data masking is achieved with valid values from the original dataset. Identical source values are systematically mapped to identical target values, ensuring consistency when required.

Prerequisites

In this guide, we want to shuffle the employees' last names as they are used in the corresponding table in the tutorial’s sample environment. The created key/value pairs are stored in a new mapping table.

The prerequisites here are:

It is possible to use an existing database table. In this case, the prerequisites are:

Connection to the database, where the source table is located. In our situation, a connection to the sample environment is necessary.
Empty H2 Mapping table in XDM as target, not necessary in a mapping table container.
Read access to the database table and write access to the mapping table.

It is also possible to store the values to be shuffled in a mapping table and then use this as the basis for shuffling.

Multiple Steps for shuffling

Prepare the Workspace

Import the required package and initialize the helper classes:

import de.ubs.xdm3.batch.modification.MappingTableUtils

def utilPOOL = new MappingTableUtils(myConnection)
def poolTable = 'TABLE_2'

Read All Relevant Values from the Source Table

Retrieve the values to be shuffled (e.g., last names) from the employees table:

def stmt = myConnection.createStatement()
def rs = stmt.executeQuery("SELECT DISTINCT last_name FROM production.employees")
def valueList = []
while (rs.next()) {
    valueList << rs.getString(1)
    }
rs.close()
stmt.close()

The DISTINCT clause in the statement serves here to specify each name exactly once so that a unique assignment is created. If you want to take the frequency of the names into account, you can extract the list of occurring IDs in addition to the list of occurring names. In this case, the statements look like this:

def stmt = myConnection.createStatement()
def rs1 = stmt.executeQuery("SELECT emp_no FROM production.employees")
def valueList1 = []
while (rs1.next()) {
    valueList1 << rs1.getString(1)
    }
rs1.close()
rs2 = stmt.executeQuery("SELECT last_name FROM production.employees")
def valueList2 = []
while (rs2.next()) {
    valueList2 << rs2.getString(1)
    }
rs2.close()
stmt.close()

Since this process works through the entire list in memory, it has limitations for the number of keys to shuffle. It is therefore not suitable for tables with more then 100 million entries. This value can be different for any environment due to available memory.

Create and Shuffle Two Lists Independently

To ensure random yet systematic assignment, create two separate copies of the value list and shuffle both:

def list1 = new ArrayList(valueList)
def list2 = new ArrayList(valueList)
Collections.shuffle(list1)
Collections.shuffle(list2)

If two lists are selected from database, use valueList1 for list1 and valueList2 for list2.

Generate and Populate the Mapping Table

Pair the entries from both lists and store pairs in the mapping table:

def n = Math.min(list1.size(), list2.size())
for (int i = 0; i < n; i++) {
     def key = [list1[i]] as String[]
     def data = [list2[i]] as String[]
     utilPOOL.merge(poolTable, key, data)
     println "Pair: ${key[0]} -> ${data[0]}"
 }
 utilPOOL.close()

All four steps have to be executed in one hook.

Benefits

Data Privacy: Shuffling masks personal data while preserving valid values for internal testing and analysis.
Systematic Mapping: Identical source values are always mapped to the same target value, ensuring consistency across tables and processes.
Flexibility: The procedure can be easily applied to any attribute or table.
Automation: The process can be automated and is easily repeatable.

Conclusion

This approach enables you to efficiently create a mapping table for data shuffling directly via XDM scripting in Groovy. It ensures consistency, supports data privacy, and is suitable for various use cases such as data masking and anonymized test data management.