Data Extractor
Description
The data extractor extracts the data records based on the object containers' selection, exclusion, and reduction definitions by using unload tools of the database or SQL queries and stores the results in files.
The type of file depends on overall settings. It can be the database utility output file of an unload, a format that can be used for the load or the XDM internal serialization format.
The files containing the resulting data of the extract will be stored in the subdirectory of the current working directory root path root/data.
This can be changed by defining the environment property DataExtractorOutput.
The data extractor for the generic data extract (when reduction, modification or sql data transport is involved) consists of multiple phases:
-
Fetch: Extraction of the data from the database
-
Parse: Parsing the data into an internal representation
-
Modify: Modification of the data if defined
-
Write: Writing the data to the target files as explained above
For generic data extraction, a corresponding statistic line is written at the end of the data extractor’s log to understand which of the phases took how long (also in relation to the total duration). This is also logged in a separate report for the corresponding task stage.
An example for a possible statistics report is shown below:
| Fetch (s) | Fetch % | Parse (s) | Parse % | Modify (s) | Modify % | Convert & Write (s) | Convert & Write % | Total (s) |
|---|---|---|---|---|---|---|---|---|
53.440 |
63.3 |
1.970 |
2.3 |
0.250 |
0.3 |
28.790 |
34.1 |
84.50 |
Please note that these phases do not exist for a data extraction using the unload utility of the database. The only indication of the duration of the extract is the total duration of the program and the used memory. Both statistics are logged at the end of the extract log, for example:
[extract] 1/1/20, 3:00:00 PM INFO Used memory: 48/1,000 MB
[extract] 1/1/20, 3:00:00 PM INFO Used elapsed time: 11.931 sec
Internally, the data extractor works with a number of threads, the default number of which is 4. Per every thread, one internal pipeline is prepared. Each of these pipelines consists of three workers: Fetch and Parse, Modify, and Writer. The workers itself run in their own threads and pass the data rows step by step to work in parallel. The tables to be processed are sorted by size in descending order and passed to the pipeline. If a pipeline is finished, it gets the next table from the list that has not been processed yet. To determine the sizes, the catalog tables or DB statistics must be available and queried. This way, the number of rows is known before processing begins and the sorting can be done.
For optimizing the performance of the data extractor, the following parameters are relevant and can be defined on the connection level:
threadCount-
The number of parallel threads being used. For each thread, a connection to the database is established. The ideal number depends on the CPU and I/O capacity of the database server. Note that the ideal number of threads may vary if the database server is executing additional workload alongside the XDM task.
fetchSize-
The number of rows that are fetched in one network operation from the database when using the generic extract unit. This unit is used for modification, reduction and in cases of SQL data transport in compatibility table copy Tasks. The larger the fetch size, the faster the read operation, but the more memory is required.
| You are encouraged to try different degrees of parallelism until you find the number of threads in combination with the fetch size that yields the highest throughput. |
Input parameters
| Parameter | Description |
|---|---|
sourceInputFile |
Source ObjectContainer |
targetInputFile |
Target ObjectContainer |
url |
JDBC URL |
threadCount |
Number of threads being used |
errorFile |
File path for errors |
user |
DB user |
databaseType |
DB Type |
password |
DB password |
sconnectionType |
SFTP or FTP for source connection |
tconnectionType |
SFTP or FTP for target connection |
sosuser |
User source |
tosuser |
User target |
sospassword |
Password for FTP source |
tospassword |
Password for FTP target |
sstreamport |
Optional different source port |
tstreamport |
Optional different target port |
shost |
Host source |
thost |
Host target |
useHooks |
Use table name mapping ( |
useEncryptedPassword |
Encrypted password ( |
targetKeyFile |
SSL Key file |
targetKeyFormat |
SSL format |
reductionRuleFile |
File path of file with reduction rules |
modificationRuleListFile |
File path to JSON list of modification rules |