Oracle - Automatic Calculation of Unload Split
With auto-calculation, by decoupling extraction from writing into files, it is possible to choose an optimal parallel degree for unloading but still produce smaller files.
Before the release of the auto-calculation feature and in the previous unload_split in data set approach, having unload splits beyond a specific number was counterproductive as the unloaded files were larger and masking was time-consuming.
Therefore, by reducing the number of connections used in reading, you can have more jobs running in parallel or more tables unloaded in parallel within the same job. This also contributes to building the masking pipeline and parallelizing the masking of files in a distributed manner.
With the auto-calculation feature, the need to manually provide the number of splits is eliminated. This not only saves time but also makes the product more user-friendly and scalable.
Parameter description
The following properties are available at the unload service level so that they can be set at the time of deployment. Additionally, you will also have an option to set these values at job level.
Parameter | Description |
---|---|
max_records_per_split | This parameter defines the maximum records unloaded in an extracted file. A default of 100,000,000 defines that each extracted file will be 100,000,000 records or less. |
max_split_per_connection | This param indicates how many splits each database connection can take. The value 4 indicates that connection can result into 4 extracted files. |
max_parallel_connections_per_table | This param defines how many maximum parallel connections are allowed for a table during the unload process. The value 4 indicates that four parallel extraction connection can be executed against the table, with each process theoretically extracting 25% of table data. |
auto_calculate_unload_split: false | If false, the unload service will unload data based legacy unload_split (unload_split in data set ) approach. If true, the unload service will unload data based ‘Automatic Calculation of Unload Split ' approach by using max_records_per_split , max_parallel_connections_per_table, max_split_per_connection. |
Process Description
Depending on the size of the table (number of rows fetched from all_tables), parallel unload connections will be computed based on the parameters max_split_per_connection and max_parallel_connections_per_table. Each unload connection will be executing a select sql based on modulo of rowid and writes every 'n' records into a file. 'n' is defined by the parameter max_records_per_split (splitSize).
In the following three examples, we consider the default values of all the parameters.
Example 1: Table size 10000000 (100 Million)Legacy (unload_split in data set) approach Automatic Calculation of Unload Split approach (Effectively no change) |
Example 2: Table size 20000000 (200 Million)Legacy (unload_split in data set) approach Automatic Calculation of Unload Split approach max_records_per_split: 100000000 Number of splits needed = 20000000 /10000000 = 2 max_split_per_connection: 4 # connections needed = 2/4 = 0.5 , use next positive integer = 1 The table is extracted with a single unload connection (1 select SQL query) spilt into two files. |
Example 3: Table size 500000000 (5 Billion)Legacy (unload_split in data set) approach Automatic calculation of unload split approach Calculation: Table size: 500000000 max_records_per_split: 100000000 Number of splits needed = 500000000 /10000000 = 50 max_split_per_connection: 4 # connections needed = 50/4 = 12.5 , use next positive integer = 13 However, max_parallel_connections_per_table: 10 therefore, only 10 connections will be opened for the table. The table will be extracted with 10 unload connections in parallel (10 select SQL queries), with each unload extracting approximately 500 Mi rows and writing into 5 files (every 100 Mi records a new file is created). |