Subset Apply

Description

Subset Apply is one of the main batch programs for XDM’s row level processing as it applies (or deletes) the data that has to be extracted previously. The execution of subset apply is based on object container files for source and target and the data to be applied into target. The Subset Apply uses apply units to insert, modify or delete data in the target system. Every ApplyUnit works on a single table. First the Subset Apply creates the apply units and gives them an order to be applied. Then the apply units can be executed in parallel, by setting the tread count higher than 1. This is controlled by the applyExecutor. Every apply unit does have a database executor which is responsible for the execution of the apply unit. By activating the batch mode and setting the batch size, the database executor can be executed in batch tranches. While the apply units are executed, the subset apply writes the applied keys into a cache file. This cache file is necessary for the resuming functionality.

The fill mode decides how the data will be applied into the target. It must be one of APPEND, DELETE, INSERT, MERGE, REPLACE, TABLE_REPLACE, UPDATE, NONE.

This will lead to different ApplyUnits, that are used internally by the Subset Apply. The used units used per table are listed in the log at the beginning of the execution.

General performance considerations

For optimizing the performance of the subset apply, the following parameters are relevant and can be defined on the connection level:

threadCount: The number of parallel threads being used. For each thread, a connection to the database is established. The ideal number depends on the CPU and I/O capacity of the database server. Note that the ideal number of threads may vary if the database server is executing additional workload alongside the XDM task.
jdbcBatchActivated: This property controls whether JDBC batch mode is supported for the given database system. If JDBC batch mode is activated, multiple rows will be bundled and sent as a batch to the database system when applying the data. Using batch mode can improve performance, but also requires more memory on both the execution server and the database server.
jdbcBatchSize: This property controls how many rows are packed into one unit when the subset apply batch mode is active. Valid values are all integer values greater than one. The maximum values is limited by the capacity of your execution server and the database server. Typically, this maximum is around 5000 rows per batch statement. It is only used when subset apply batch mode is active.
resumeSubsetApply: Allows a task to be resumed in the final stage, where the data is written to the target. With this option enabled, a task which fails or is canceled can be resumed at the last written row. The option ensures that rows will not be applied twice. When this option is disabled, then a canceled or aborted task will be resumed from the beginning. Whether rows are applied twice in the target then depends on the Fill mode setting. This property increases the runtime and storage usage. To improve runtime and storage usage, this option should be disabled.
jdbcProfiling: This option increases the memory consumption and decreases the performance of the subset apply. It will give indications about the performance of the database operations. In the resulting files, the execution times per table and statement can be found. This allows to identify performance bottlenecks in the database operations per statement.

Specific PostgreSQL performance considerations

As the implementation of the PostgreSQL JDBC driver does not support batch mode, special ApplyUnits are used for PostgreSQL.

These units are only present for the fill modes INSERT and TABLE_REPLACE.

Per default, every insert operation is attempted to be executed via an inline COPY utility call. The performance of the COPY utility is about 10 times better than the performance of a single insert statement.

If the inline COPY utility call fails, the Subset Apply will fall back to a multi values insert statement for the affected table.

If the multi value insert detects a discard situation, the block of rows to be inserted will be performed with the default insert statements.

Furthermore, the COPY utility call and the multi value insert statement cannot be used for tables with generated always columns. In these cases, the Subset Apply will directly use the default insert statements for these tables.

This strategy leads to a very high performance for the insert operation, as the COPY utility is used as long as possible. If there are any issues with the COPY utility, the performance will be reduced to a multi value insert statement, which is still better than a single insert statement.

Generated Always columns and foreign keys

If foreign keys are present in the target system, it is possible that some tables contain columns that are defined to be generated always. In all cases, new generated always values will be created by the database system itself.

The behavior of the Subset Apply for each table depends on the fill mode. Regardless of the fill mode, the generated values of generated always columns are propagated to the dependent tables via the foreign keys.

Names

Short	subapply
Full	SubsetApply

Input parameters

Parameter	Description
sourceObjectContainerFile	The object container file of the source system.
inputObjectContainerFile	The object container file of the target system, into which the data should be applied.
inputDirectory	Path to the directory containing the data to be applied.
applyDir	Path to a directory into which the subset apply writes the applied keys as a cache. Necessary for resuming functionality.
useSimpleDirectories	Boolean if appending container name to inputDirectory should be skipped. Default: false
fillMode	Method how the data should be applied to the target. Default: INSERT
threadCount	How many threads should be used. Default: 1
classificationTermKeyFile	Path to a file containing information about classification term keys.
columnExcludeRuleFile	Path to a file containing information about column exclude rules.
columnSpecificationRuleFile	Path to a file containing information about column specification rules.
foreignKeyDeactivation	Boolean if foreign keys should be deactivated for applying data. One of NONE (default), DISABLE and DROP.

Parameter

Description

sourceObjectContainerFile

The object container file of the source system.

inputObjectContainerFile

The object container file of the target system, into which the data should be applied.

inputDirectory

Path to the directory containing the data to be applied.

applyDir

Path to a directory into which the subset apply writes the applied keys as a cache. Necessary for resuming functionality.

useSimpleDirectories

Boolean if appending container name to inputDirectory should be skipped. Default: false

fillMode

Method how the data should be applied to the target. Default: INSERT

threadCount

How many threads should be used. Default: 1

classificationTermKeyFile

Path to a file containing information about classification term keys.

columnExcludeRuleFile

Path to a file containing information about column exclude rules.

columnSpecificationRuleFile

Path to a file containing information about column specification rules.

foreignKeyDeactivation

Boolean if foreign keys should be deactivated for applying data. One of NONE (default), DISABLE and DROP.

Output parameters

Parameter	Description
return code	The return code of the program.

Parameter

Description

return code

The return code of the program.

Output information

Subset apply creates reports containing statistics for applied, deleted and discarded rows.