Data Extractor

Description

The data extractor extracts the data records based on the object containers' selection, exclusion, and reduction definitions by using unload tools of the database or SQL queries and stores the results in files.

The type of file depends on overall settings. It can be the database utility output file of an unload, a format that can be used for the load or the XDM internal serialization format.

The files containing the resulting data of the extract will be stored in the subdirectory of the current working directory root path root/data. This can be changed by defining the environment property DataExtractorOutput.

The data extractor for the generic data extract (when reduction, modification or sql data transport is involved) consists of multiple phases:

  • Fetch: Extraction of the data from the database

  • Parse: Parsing the data into an internal representation

  • Modify: Modification of the data if defined

  • Write: Writing the data to the target files as explained above

For generic data extraction, a corresponding statistic line is written at the end of the data extractor’s log to understand which of the phases took how long (also in relation to the total duration). This is also logged in a separate report for the corresponding task stage.

An example for a possible statistics report is shown below:

Fetch (s) Fetch % Parse (s) Parse % Modify (s) Modify % Convert & Write (s) Convert & Write % Total (s)

53.440

63.3

1.970

2.3

0.250

0.3

28.790

34.1

84.50

Please note that these phases do not exist for a data extraction using the unload utility of the database. The only indication of the duration of the extract is the total duration of the program and the used memory. Both statistics are logged at the end of the extract log, for example:

[extract] 1/1/20, 3:00:00 PM INFO   Used memory: 48/1,000 MB
[extract] 1/1/20, 3:00:00 PM INFO   Used elapsed time: 11.931 sec

Internally, the data extractor works with a number of threads, the default number of which is 4. Per every thread, one internal pipeline is prepared. Each of these pipelines consists of three workers: Fetch and Parse, Modify, and Writer. The workers itself run in their own threads and pass the data rows step by step to work in parallel. The tables to be processed are sorted by size in descending order and passed to the pipeline. If a pipeline is finished, it gets the next table from the list that has not been processed yet. To determine the sizes, the catalog tables or DB statistics must be available and queried. This way, the number of rows is known before processing begins and the sorting can be done.

Diagram
Diagram

For optimizing the performance of the data extractor, the following parameters are relevant and can be defined on the connection level:

threadCount

The number of parallel threads being used. For each thread, a connection to the database is established. The ideal number depends on the CPU and I/O capacity of the database server. Note that the ideal number of threads may vary if the database server is executing additional workload alongside the XDM task.

fetchSize

The number of rows that are fetched in one network operation from the database when using the generic extract unit. This unit is used for modification, reduction and in cases of SQL data transport in compatibility table copy Tasks. The larger the fetch size, the faster the read operation, but the more memory is required.

You are encouraged to try different degrees of parallelism until you find the number of threads in combination with the fetch size that yields the highest throughput.

Names

Short

extract

Full

DataExtractor

Input parameters

Parameter Description

sourceInputFile

Source ObjectContainer

targetInputFile

Target ObjectContainer

url

JDBC URL

threadCount

Number of threads being used

errorFile

File path for errors

user

DB user

databaseType

DB Type

password

DB password

sconnectionType

SFTP or FTP for source connection

tconnectionType

SFTP or FTP for target connection

sosuser

User source

tosuser

User target

sospassword

Password for FTP source

tospassword

Password for FTP target

sstreamport

Optional different source port

tstreamport

Optional different target port

shost

Host source

thost

Host target

useHooks

Use table name mapping (true or false)

useEncryptedPassword

Encrypted password (true or false)

targetKeyFile

SSL Key file

targetKeyFormat

SSL format

reductionRuleFile

File path of file with reduction rules

modificationRuleListFile

File path to JSON list of modification rules

Output parameters

Parameter Description

return code

Return Code