Data Modification

Data Modification enables you to change data during task execution. Data modification during a copy is crucial for masking data. In the following articles, there will be a detailed insight into the technical backgrounds of data modification.

Overview

Masking is set up in multiple steps.

First there are the modification methods. Those methods define how a data value or data row should be changed for specific data. Those changes can happen for privacy reasons as well in order to produce some special test value. The chapter Modification Methods contains more information. The script in a modification method defines how a single value or entire row of table data should be modified.

Groovy:

def apply() {
    data[columnIndex] = 'New Value'
}

This simple example script sets the constant string 'New value' for all rows on the column the modification method is applied to by a modification rule. The following chapters will explain how to configure more complex scenarios.

The next step is the definition of modification rules. Rules define which columns of a table are changed during task execution. The chapter Modification Rules contains more information.

Modification sets are the last step. These sets allow collecting rules by application tables and their purpose. These sets can be on different objects throughout the installation, a point that is addressed in the chapter Apply Modification Sets to Objects.

XDM is shipped with a lot of useful default modification methods. Users can find a description of available modification methods and their parameters in the article modification methods in the chapter Useful Examples.

Modification Methods

Modification methods are code snippets (written in Groovy or JavaScript) that are used to modify the content of a row as it is processed by XDM. This chapter describes how to create modification methods, how they are used, the definition of parameters, and possible side effects to watch out for.

Introduction

Modification methods are used to change values of cells during the copy process. This can be used, for example, to mask personally identifiable data before it is written to an output file or target table, to generate new values for columns that only exist in the target table but not in the corresponding source table, or for changing dates to simulate a time shift.

It is possible to add, edit, and delete modification methods in order to customize how XDM masks sensitive data, and how data for extra target columns is to be generated during a copy process.

XDM ships with a set of default modification methods that can be used directly in a copy process. In addition, these pre-defined modification methods demonstrate how modification methods are written and how to access the data of a table row within the method.

A modification method is linked to a specific table name with a modification rule. The rule specifies patterns for schema name, table name and column name to define the possible tables on which a method should be used. In addition, a modification rule allows the specification of parameters and optional default values for a modification method. Learn more about modification rules in the linked chapter.

It is not possible to use more than one modification method in a single modification rule. Instead, if XDM is to modify data with more than one modification method, two or more modification rules should be specified.

A detailed description on how to create custom modification methods, use them in modification rules and apply these to a task that copies data can be found in chapter Modification Sets and Modification Methods in the tutorial.

Overview of Modification Method Processing

If XDM processes data from a table where one or more columns have associated modification rules, the data of the row is passed to the corresponding modification method(s). A modification method operates on the row and changes the content corresponding to the selected columns. After all modification methods have been executed, processing of the modified row continues normally.

Modification methods can also operate on columns that do not exist in the source environment. To do so, set the Scope option of the modification rule to Target. A modification method that is associated with one or more columns that only exist in the target environment can generate data based on the values in other columns of the selected row. An example of this is provided in the data type property description of the reference documentation for modification method parameters.

A modification method can inadvertently change data such that it is incompatible with the data type or length of the target column. XDM cannot check whether a modification method yields incompatible values before the task is executed, because the result may depend on the actual data stored in the source table. When a modification method is applied to a column, the resulting values must be compatible with the target column in both size and type in order to avoid processing errors.

Using modification methods may violate constraints, such as referential integrity constraints, unique constraints, or check constraints. For table copying task templates, depending on the task template settings, XDM may or may not prevent the target database system from checking whether all constraints are satisfied at the end of the copy process. When a modification method is to be applied to a column, it is up to the user to ensure that the resulting values will not violate any constraints.

General Settings

The General Settings panel defines the primary modification method settings. The properties specified on this panel are described in the modification method reference documentation. The Name, Category, Modification type, and Modification Code properties are mandatory.

Parameters

Parameters are a special sub-category which have their own properties. These are described in detail in the modification method parameter reference documentation.

A modification method need not have any parameters. However, it is often useful to define parameters for a modification method, because the parameter values can be set explicitly for each modification rule in which the method is used.

Modification parameters used in modification methods are simply variables in the Groovy/JavaScript code. The code, therefore, determines the number and type of parameters for which values must be supplied. Parameter values are supplied by the modification rule which uses the method. The modification rule can, however, define more parameters than are required by the method. This can be useful when trying out various different methods for a modification rule.

Defining Method Parameters

The following code shows how to define a modification method with method parameters. They can be implemented using Groovy or JavaScript.

Groovy:

def apply() {
    data[columnIndex] = newValue
}

JavaScript:

function apply() {
    data[columnIndex] = newValue;
}

Note that newValue corresponds to the variable name for the parameter and supplies the value for the column. A modification rule using this method must provide a value for the newValue parameter.

Using the Connection Data Type

The Connection data type would typically be used with the method ctx.fetchSingleValueFromConnection(…) in the modification method code. This allows SQL statements to be executed against the specified connection. The connection parameter’s variable name takes the place of the first parameter, which is a string. Note that the variable name must be entered as a string value. For instance, see the example in the data type property reference documentation.

The connection assigned to the variable is specified in the modification rule which uses the method. It is selected from a list of connections. The list will contain those connections for which the user has read access and source usage permission.

The SQL statement to be executed is entered in the second parameter, which is also a string.

Specifying Parameter Values in Modification Rules

When the method in the previous example is used in a modification rule, the modified column value would be the default value of the variable specified in the method definition, provided a default has been defined. The default value can be overridden in the rule by specifying a different value for the variable in the parameter list. Any variable used in the chosen method must have a value assigned to it.

The value for a parameter in a modification rule can be supplied in two ways:

By Constant: The value will be supplied to the modification method code as it is defined in the rule.
By Table Column: The value supplied to the modification method code will be read from the named column, as defined by the rule, in the currently processed table. Of course, for every table matching the modification rule, the column named by the parameter must exist in the table.
By Custom Parameter: The current value of the custom parameter is passed to the modification method.

Using Custom Parameters in Modification Rules

Custom parameters are defined on the modification set, which can provide a default value for the parameter. When the custom parameter is also defined on the task template or task, which uses the modification set, the value defined there will be used instead of the default value defined on the modification set. If data-type conversion is enabled, the value of the custom parameter will be converted to its data type. This enables the modification method to directly use a typed parameter value. If the conversion is disabled, the value will be passed as a string.

Column based methods and row based methods

When creating a new modification method, the method is defined as either column based or row based. For column based methods, a column pattern must be specified in the modification rule. The specified modification method is applied to every column which matches the specified column pattern in every table which matches the specified table pattern.

For instance, consider the situation where the table TBEMP has columns FIRSTNMAME and LASTNAME and the table TBDEPT has the column DEPTNAME. Suppose also that a modification method called "Scramble Strings" is defined in order to anonymize string data. If both tables are selected in a task, then by using a single column based rule with name pattern "TB%" and column pattern "%NAME", the method "Scramble Strings" can be applied to the columns TBEMP.FIRSTNAME, TBEMP.LASTNAME, and TBDEPT.DEPTNAME.

For row based methods it is possible to modify more than one column. Indeed, a row based method can modify any number of columns in a row. A modification rule using such a method need not specify a column pattern. If a column pattern is specified in a row based method, then it will be ignored.

The method code determines what type of modification takes place. If you define a column based method with code designed to modify a row based on column indexes, then a rule using that method will behave as if a row based method were used.

Although XDM will allow the use of row based code in a method defined as column based, you should not do this. When a modification rule specifies a column pattern, only the modified values for those columns will be checked to determine whether they are valid for the target column definitions. Furthermore, conversions which might be necessary for compatibility of JavaScript data types with database types will only be applied to the values for columns that match the column name pattern.

The converse, however, is not the case. If a method is defined as row based, but has code designed for column based processing, then an error will occur, because the standard variables columnName and columnIndex are not available to methods defined as row based.

Implementing Lookup tables: Mapping table containers and other solutions

In many masking and modification situations, you need the possibility to store data to get values from or as a persistent memory to be able to assign calculated or generated values to their initial value. This serves to set up consistent, deterministic masking sequences.

In relation with names, addresses and other values from the real world, a mechanism is helpful that relies on a base of available values and selects a value via a hash procedure either from the initial value or another base value such as a person number or other classification terms from the lookup table.

There are mapping table containers within XDM for such an implementation. These containers offer a persistent memory, an API interface prepared for the masking purpose in hooks and modification methods, ensure the provision of data even in a multi-node execution in the cloud environment and regulate transaction security by means of locks on the respective task executions that use the mapping table containers through the configured hooks and modification methods.

There are pre-configured mapping table containers for XDM that contain lists for names, addresses, bank information and more. See lookup tables chapter.

Mapping table containers can also be created with your own data. A hook using the MappingTableUtils API is generally used for this purpose. This hook can be executed via a workflow. Typically, this table is then rebuilt once before a large masking run or at regular intervals. Within the masking methods, these mapping table containers are only used read-only. This means that there is no lock problem if several tasks with modification methods that access the mapping table container are executed in parallel.

An example is available in the script examples for task stage hooks.

There are situations, where it is necessary to save information from a task run in order to reuse it in other executions of the modification. For example if you generate a new classification term number and want to replace it consistently in all occurrences and also in repeating task executions. Then mapping table containers are not a suitable choice, as the transaction span from the start of execution to the end of execution is too long. The transaction span is necessary to ensure consistent use of the written information. But this blocks any parallel task execution.

Mapping table containers are optimized to be written once and then offer high performance with many read accesses.

If transaction-safe writing dynamically from task runs is also required, there are other variants of the MappingTableUtils, which then write to a specific relational database. There is a variant that uses the XDM administration database as central storage, as well as the option of using any other database in the system landscape.

An example of how to use the MappingTableUtils API for persisting a mapping table inside the XDM administration database, see the paragraph Use stored data from XDM’s Administration Database in the script examples.

The MappingTableUtils API can also be used to manage information within a task execution. For example, to transport information between modification methods with scope source and those with scope target or to take information for an icebox at the time of To Icebox Task that is not in the actual user data required for decisions in the From Icebox Task. For this purpose, the MappingTableUtils class is instantiated with a configuration with a file within the task execution directory.

How to use the API in correspondence with storing the mapping information in a local file, see this scenario in the script examples.

Modification Rules

Modification rules change the content of tables while they are being processed.

The order of the modification rules influences the result of modification. If two or more rules modify the same value, all rules are nevertheless applied in ascending order. If modification rules are defined on multiple modification sets, the rule defined in a set more specific for the task execution will be applied later. For example, a rule in a modification set of the task will be applied after a modification rule defined in a set on the connection.

When using modification for a column, you should ensure that the data can be copied to the target table. For example, when shrinking data using modification, in order to avoid compatibility errors resulting from truncation, ensure that the Toleration Mode is set to Allow truncation.

Examples

Typical scenarios for using modification rules are:

You want to replace personally identifiable data in a column of the source table with random data. Assume you have a column LASTNAME, which contains a person’s last name. To replace the data you can add a modification rule for LASTNAME and associate it with a modification method that generates a random string. A modification method that generates a random string typically takes at least one parameter defining the maximum length of the resulting string. You can set this parameter to the maximum length of the column. The resulting random string will never exceed the set length. After the copy, the target table will contain random strings in the column that contains last names in the source environment.
You want to generate data for a column that only exists in the target environment, such as an e-mail address, based on the corresponding person’s first and last names. You can add a modification rule to the modification set used by the task template that matches the table and column that contains the e-mail address in the target environment by setting the option Scope to Target. This modification method could compose a seemingly valid e-mail address by combining the column values for the person’s first and last names with the suffix @example.com (or similar).

Modification rules to fill a column only existing in the target

If columns exist in the target environment, which does not exist in the source environment, it can be necessary to fill these columns automatically during task execution. In detail, these columns are NOT NULL but have no default value to fill it automatically.

XDM uses modification rules with scope Target for this purpose. The modification must be set on the column to fill. The values of the default value of the column will be defined by a modification method. Here the user can define its own logic to generate the default.

If the column is unique the method must generate a unique value for each unique row identifier. This can be achieved, for instance, by using a counter as part of the unique row identifier. The method Generator/autoincrement is provided as an example of such a modification method.

Modification Sets

Modification rules are grouped by modification sets. These sets allow collecting rules by application tables and their purpose. So it might be possible to have one set for all the tables storing customer data that contains all the rules to mask the data. Another set is also for the customer tables, but holds the rules to generate new key values for a customer. Other sets also exist for contract data tables, and so on.

Apply Modification Sets to objects

These sets can be utilized in task executions by applying them to one of the used objects in the task.

Modification sets can be applied on:

Connection
- To enforce modification for the purpose of masking during reads from, for example, production, masking sets can be attached to the connection. Any task reading from this environment will be subject to the modification. Through appropriate permission settings, it can also be enforced that this modification cannot be deactivated.
Application Model level
- Modification can be used at the application model level to make functionally relevant adjustments. For example, implementing internal IDs, converting email addresses to test accounts, or reassigning user mappings.
Version level
- When a new version is available in an application model and tables have changed or been added, they may need to be modified differently than in the previous version.
Environment level
- Modification can be attached to the environment when, for example, references to other systems need to be adjusted in the test system (such as URLs to third-party systems associated with the test environment).
Installed application level
- When one or more environments have the same installed application but still need to maintain a dynamic approach to data modification that can not be done on an Environment level or is too high maintenance for any of the other levels
Task template level
- Typically used during the development of a modification set or to attach workarounds to specific executions.

Using Modification Sets in Task Templates

Modification can be used in nearly all task templates, where data is copied. In detail, it is possible to use modification in these task templates:

Native table copy task templates,
Compatibility table copy task templates,
Compatibility table to icebox task template,
Compatibility table from icebox task template,
Row level processing task templates,
Row level to icebox task templates, and
Row level from icebox task template.

All other task types do not allow modification.

Limitations of Modification Functions in Native Table Iceboxes

Native Table Iceboxes do not support modification functions. This limitation exists for technical reasons based on how data is extracted and stored in different scenarios.

Data Extraction Methods

To understand, why Native Table Iceboxes do not support modification functions, it is necessary to understand, how XDM will extract and load data in native table copy and native table icebox tasks.

When modifications are enabled for a table, data extraction in Native Table Copy Tasks is handled with a Generic Extract method. In contrast, tables without modifications use the Unload method.

Generic Extract: Extracts and temporarily stores the data in a format that allows modifications.
Unload: Extracts data in the native database format. This format cannot be altered by XDM after extraction.

More details about data extraction can be found in the section Extracting, Data Transport and Loading.

In a Native Table Copy Task, XDM is able to control the format of how data is extracted.

When using a native table copy to icebox task it is not possible to determine which data extract format needs to be used, because the data extract format depends on if the data needs to be modified in the native table copy from icebox task which will be executed later. Therefore, modification cannot be offered within the native table from icebox task. To ensure consistency, modification support is also not available in native table to icebox tasks.

How to modify data in iceboxes

If you require modification functions for tables in an Icebox, use compatibility table icebox tasks instead of native table icebox tasks.

Alternatively, you can: 1. Restore the data from the Native Icebox. 2. Use a Native Table Copy Task to anonymize or modify the data in place as a separate step.

This process ensures your modification requirements are met while maintaining system integrity.

Modification Ordering

The following chapter describes the order in which modifications are applied in XDM. The order of the modifications has a direct effect on the modified data values.

Example: You want to anonymize the first and last name and the email address. The email address is defined as first name, last name and the domain. First name and last name are columns in this table.

If you modify the names before the e-mail address, the e-mail address already contains the modified names. However, if you modify the email address before the names, the e-mail address still contains the unmodified names.

The order of modifications is applied hierarchically. The hierarchy is defined as follows:

Scope
- Type of object
  - Order of the modification sets on the object
    
    Order of the rules in the set with the same scope

Scope

All Modification Rules with Scope Source are executed before those with scope target. Although the scope is defined in the modification rules it applies to each object individually. Rules with scope source only apply to the source objects, while rules with scope target only apply to the target objects. A rule with scope source will not be used in target objects, e.g. in a target connection. Rules with scope source are applied after the data extract, rules with scope target are applied before the data is applied.

Type of object

At which point of the copy process a particular table or column is modified depends, in addition to the scope, on the object type the modification set is attached to. The order of objects used for modifications is defined as follows:

Connection
Application model
Application model version
Environment
Installed application
Task template

If one modification set is attached to a connection and another to the task template, the rules of the set which is attached to the connection are applied first.

Order of the sets on the object

If several modification sets are attached to an object, they are processed from top to bottom. The list order can be defined via the ordering mode within the respective object.

Order of rules with the same scope

Within a modification set, the default execution order is from top to bottom. The default execution order of rules of the same scope can be changed via the ordering mode in the modification set. If two modification rules modify two different columns of a table, the ordering mode could be used to define which column is modified first.

Example

In this example the following is assumed:

An RLP task is used.
Source and target connection is the same connection.
The application model is the same for source and target.

Two modification sets are attached to the connection. Furthermore two modification sets are attached to the application model and three to the task template.

So the hierarchy looks like this:

Connection
- Modification Set 1
  - Modification Rule 1 (Scope Target)
  - Modification Rule 2 (Scope Source)
  - Modification Rule 3 (Scope Target)
- Modification Set 2
  - Modification Rule 1 (Scope Target)
  - Modification Rule 2 (Scope Target)
  - Modification Rule 3 (Scope Source)
Application Model
- Modification Set 1
  - Modification Rule 1 (Scope Source)
  - Modification Rule 2 (Scope Target)
  - Modification Rule 2 (Scope Source)
  - Modification Rule 4 (Scope Target)
- Modification Set 2
  - Modification Rule 1 (Scope Target)
  - Modification Rule 2 (Scope Source)
Task Template
- Modification Set 1
  - Modification Rule 1 (Scope Source)
- Modification Set 2
  - Modification Rule 1 (Scope Target)
  - Modification Rule 2 (Scope Source)
- Modification Set 3
  - Modification Rule 1 (Scope Source)

Ordering of modification in this example

After the data extract in stage 2 the following modification is applied:

Connection
- Modification Set 1
  - Modification Rule 2 (Scope Source)
- Modification Set 2
  - Modification Rule 3 (Scope Source)
Application Model
- Modification Set 1
  - Modification Rule 1 (Scope Source)
  - Modification Rule 2 (Scope Source)
- Modification Set 2
  - Modification Rule 2 (Scope Source)
Task Template
- Modification Set 1
  - Modification Rule 1 (Scope Source)
- Modification Set 2
  - Modification Rule 2 (Scope Source)
- Modification Set 3
  - Modification Rule 1 (Scope Source)

Before the data apply in stage 6 the following modification is applied:

Connection
- Modification Set 1
  - Modification Rule 1 (Scope Target)
  - Modification Rule 3 (Scope Target)
- Modification Set 2
  - Modification Rule 1 (Scope Target)
  - Modification Rule 2 (Scope Target)
Application Model
- Modification Set 1
  - Modification Rule 2 (Scope Target)
  - Modification Rule 4 (Scope Target)
- Modification Set 2
  - Modification Rule 1 (Scope Target)
Task Template
- Modification Set 2
  - Modification Rule 1 (Scope Target)