Subset Extract

Description

The subset extractor is the main batch program for XDM’s row level processing as it extracts the relevant data and stores it into files. The extraction is based on an input object container, a start table, one or more start values and defined data relation rules and reduction rules. It also uses modification for scope SOURCE from a JSON file that can be found under etc/modificationRules.json when used in an XDM row level processor task. The data relation rules that are used for defining relations between tables are being read from another JSON file and can equivalently be found under etc/dataRelationRules.json when used in an XDM row level processor task.

The internal program logic is as follows: The subset extract has a processing controller, which controls how many threads are running at the same time. These threads are key processors. The amount of key processors can be set by the thread count. The number of configured threads is shown in the log output:

[subextract] 1/1/20, 3:00:00 PM INFO   ProcessingController created (running 1 thread(s), queue size: 50)

The key processors are responsible for the extraction of the data for the given keys. They also determine dependent keys in other tables that are connected by relations. So the subset extract can collect dynamically all keys to fetch the right data. After collecting the keys the subset extract can start fetching the data, the number of rows that are fetched in one selection operation from the database is defined by the extractPackageSize. The number of packages that can be cached by the subset extractor, before the process is slowed down to empty the cache is defined by the extractQueueSize. The queue size is also part of the processing controller output.

The following graphic shows an abstracted view of the subset extractor’s internal processing:

The number of KeyProcessors are defined by the thread count. The KeyProcessors are responsible for the extraction of the data for the given keys. Whenever a KeyProcessor finds references to other tables, it will request a LoadProcessor these LoadProcessor use the extractPackageSize to fetch the data in packages to fetch the data for the dependent keys. After this done a new KeyProcessor gets started to process it. This is coordinated by the ProcessingController.

The selected data is stored in DAT files as a result of the subset extraction. Information on what was extracted can be found under files/extract_overview.txt in the task’s base directory.

This program should not be used in a hook as it would be too difficult to submit all necessary information to run it correctly.

For optimizing the performance of the subset extractor, the following parameters are relevant and can be defined on the connection level:

threadCount: The number of parallel threads being used. For each thread, a connection to the database is established. The ideal number depends on the CPU and I/O capacity of the database server. Note that the ideal number of threads may vary if the database server is executing additional workload alongside the XDM task.
extractPackageSize: The number of rows that are fetched in one selection operation from the database. A larger package size can enhance the performance of the extraction, but it also increases the memory consumption. It also influences the number of keys that are fetched with a single query. If the package size is too large, this can lead to queries which are too long for the source database system.
extractQueueSize: The number of packages that can be cached by the subset extractor, before the process is slowed down to empty the cache. A bigger cache leads to a better performance, but also uses more memory. This can lead to queries which are too long for the target database system.
extractTraceActive: This option increases the memory consumption and decreases the performance of the subset extractor. It should be used for development and debugging purposes only.
jdbcProfiling: This option increases the memory consumption and decreases the performance of the subset extractor.

For collecting XML data from a Db2 LUW Database, where the STRING_UNITS parameter on the database is CODEUNITS16 or CODEUNITS32 instead of OCTETS, it is necessary to set the value for the custom parameter luwStringUnits on the source connection to CODEUNITS16 or CODEUNITS32.

Logging

The subset extractor logs information about the extraction process. Information about the throughput can be read from the log output. Every 30 seconds the throughput is calculated and logged. This time can also be configured by the task runtime property subsetextract.statisticwriter.sleeptime, time given in milliseconds.

A typical block looks like this:

[subextract] 12/11/24, 1:56:39 PM INFO    Extract process for last 30 sec.
[subextract] +----------------------+-----------+--------+----------+---------------+--------------+----------------+----------------+-----------+------------+------------+----------+
[subextract] |     Source Table     | Extracted |  New   | Rows/min | DB Fetch (ms) | Convert (ms) | Key Check (ms) | Traversal (ms) | Mod. (ms) | Write (ms) | Other (ms) | Sum (ms) |
[subextract] +----------------------+-----------+--------+----------+---------------+--------------+----------------+----------------+-----------+------------+------------+----------+
[subextract] | production.dept_emp  |    308139 | 164095 | 328190.0 |          3276 |          844 |           3396 |          40662 |        22 |        942 |      11283 |    60428 |
[subextract] | production.employees |    285333 | 148333 | 296666.0 |          3330 |          878 |           3194 |          34916 |        24 |       1224 |      16209 |    59777 |
[subextract] +----------------------+-----------+--------+----------+---------------+--------------+----------------+----------------+-----------+------------+------------+----------+
[subextract] 12/11/24, 1:56:39 PM INFO    Overall Rows processed:       593,516, heap usage  437/ 512 MB
[subextract] 	Pending keys to be processed(583,526 overall):
[subextract] 		production.departments ([dept_no]): 493
[subextract] 		production.dept_manager ([dept_no]): 9
[subextract] 		production.employees ([emp_no]): 13024
[subextract] 		production.salaries ([emp_no]): 285000
[subextract] 		production.titles ([emp_no]): 285000

The table shows all processing done on a particular table for the last logging interval. The columns are defined as follows:

Source Table: The table that was processed.
Extracted: The number of rows that were extracted over the complete runtime.
New: The number of rows that were new in the last logging interval.
Rows/min: The throughput per minute in the last logging interval.
DB Fetch: The time spent fetching data from the database.
Convert: The time spent converting the data.
Key Check: The time spent checking the keys.
Traversal: The time spent reading the traversal information from the fetched data. This is enabled if at least one data relation rule in the configuration use a traversal count greater than 0. Then, for each row returned by the database, the key that was used to get the data needs to be identified.
Mod.: The time spent applying the modification rules.
Write: The time spent writing the data to the output file.
Other: The time spent on other tasks not named above.
Sum: The sum of all times spent on the tasks.

In the example above we see, that the sum for the processing is bigger than the interval. This happens because of parallel processing. The sum of all processing should ideally be the same as the interval multiplied by the number of worker threads. Then, the number of overall extracted rows for all tables since the beginning of the process is shown. The heap usage is also shown.

The pending keys to be processed are shown in the next lines. These are keys, that need to be processed in SELECT queries to fetch the data. This list shows the snapshot of pending key queries at the time of the logging. With each iteration the numbers and listed tables may change. The listing gives a rough overview of the pending keys and the amount of data that is still to be processed. With touching a new table in the entire graph, additional tables that are referenced by data relation rules may come into view and make up new pending keys. The process ends, if there are no new keys to fetch.

Names

Short	subextract
Full	Extract

Input parameters

Parameter	Description
objectContainerPath	Path to the object container file.
startTable	The selection’s start table, using pattern [TABLE_SCHEMA].[TABLE_NAME].
startValuesFilePath	The path to a CSV file containing the start values.
outputDir	The directory into which the extracted data reference is placed. The data itself will be stored in the configured blob store for the XDM installation. The reference file contains the internal path for the blob store.
tempDir	The file path for a temporary file. Subset extractor is storing information in a BerkeleyDB temporarily.
reductionRuleFile	Path to a JSON file containing reduction rules.

Parameter

Description

objectContainerPath

Path to the object container file.

startTable

The selection’s start table, using pattern [TABLE_SCHEMA].[TABLE_NAME].

startValuesFilePath

The path to a CSV file containing the start values.

outputDir

The directory into which the extracted data reference is placed. The data itself will be stored in the configured blob store for the XDM installation. The reference file contains the internal path for the blob store.

tempDir

The file path for a temporary file. Subset extractor is storing information in a BerkeleyDB temporarily.

reductionRuleFile

Path to a JSON file containing reduction rules.

Output parameters

Parameter	Description
return code	The return code of the program.

Parameter

Description

return code

The return code of the program.

Output information

The subset extractor creates reports containing statistics for extraction and modification.

Besides these reports, the extracted data is stored in DAT files inside the output directory (defined via input parameter outputDir).