Oracle - Automatic Calculation of Unload Split

With auto-calculation, by decoupling extraction from writing into files, it is possible to choose an optimal parallel degree for unloading but still produce smaller files.

Before the release of the auto-calculation feature and in the previous unload_split in data set approach, having unload splits beyond a specific number was counterproductive as the unloaded files were larger and masking was time-consuming.

Therefore, by reducing the number of connections used in reading, you can have more jobs running in parallel or more tables unloaded in parallel within the same job. This also contributes to building the masking pipeline and parallelizing the masking of files in a distributed manner.

With the auto-calculation feature, the need to manually provide the number of splits is eliminated. This not only saves time but also makes the product more user-friendly and scalable.

Parameter description

The following properties are available at the unload service level so that they can be set at the time of deployment. Additionally, you will also have an option to set these values at job level.

Parameter	Description
max_records_per_split	This parameter defines the maximum records unloaded in an extracted file. A default of 100,000,000 defines that each extracted file will be 100,000,000 records or less.
max_split_per_connection	This param indicates how many splits each database connection can take. The value 4 indicates that connection can result into 4 extracted files.
max_parallel_connections_per_table	This param defines how many maximum parallel connections are allowed for a table during the unload process. The value 4 indicates that four parallel extraction connection can be executed against the table, with each process theoretically extracting 25% of table data.
auto_calculate_unload_split: false	If false, the unload service will unload data based legacy unload_split (unload_split in data set ) approach. If true, the unload service will unload data based ‘Automatic Calculation of Unload Split ' approach by using max_records_per_split , max_parallel_connections_per_table, max_split_per_connection.

Process Description

Depending on the size of the table (number of rows fetched from all_tables), parallel unload connections will be computed based on the parameters max_split_per_connection and max_parallel_connections_per_table. Each unload connection will be executing a select sql based on modulo of rowid and writes every 'n' records into a file. 'n' is defined by the parameter max_records_per_split (splitSize).

In the following three examples, we consider the default values of all the parameters.

Example 1: Table size 10000000 (100 Million)

Legacy (unload_split in data set) approach
If unload_split (in data set) is defined as 1, the table will be extracted into a single file.

Automatic Calculation of Unload Split approach
The table will be extracted with a single unload process (1 select SQL query) into a single file.

(Effectively no change)

Example 2: Table size 20000000 (200 Million)

Legacy (unload_split in data set) approach
If unload_split (in data set) is defined as 2, the table will be extracted into 2 files by running 2 unload processes (2 select SQL queries).

Automatic Calculation of Unload Split approach
Calculation:
Table size: 20000000

max_records_per_split: 100000000

Number of splits needed = 20000000 /10000000 = 2

max_split_per_connection: 4

# connections needed = 2/4 = 0.5 , use next positive integer = 1

The table is extracted with a single unload connection (1 select SQL query) spilt into two files.

Example 3: Table size 500000000 (5 Billion)

Legacy (unload_split in data set) approach
If unload_split (in data set) is defined as 10, the table will be extracted into 25 files (each file 200 Mi) by running 25 unload connection(25 select SQL queries)

Automatic calculation of unload split approach

Calculation:

Table size: 500000000

max_records_per_split: 100000000

Number of splits needed = 500000000 /10000000 = 50

max_split_per_connection: 4

# connections needed = 50/4 = 12.5 , use next positive integer = 13

However, max_parallel_connections_per_table: 10 therefore, only 10 connections will be opened for the table.

The table will be extracted with 10 unload connections in parallel (10 select SQL queries), with each unload extracting approximately 500 Mi rows and writing into 5 files (every 100 Mi records a new file is created).
NOTE: Though max_split_per_connection =4, yet system will unload data into 5 files to respect the value of max_parallel_connections_per_table.