Data source support

Oracle connector

Oracle Database (commonly referred to as Oracle RDBMS or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. The following table lists the versions that have been tested in the lab setup:

Platforms	Version
Linux	Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production - AWS Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production - GCP

User on source database must select privileges
User on target database side must have all privileges and SELECT_CATALOG_ROLE.

Supported Data Types

The following are the different data types that are tested in our lab setup:

VARCHAR
VARCHAR2
NUMBER
FLOAT
DATE
TIMESTAMP(default)
CLOB
BLOB(with text)
User Defined Types:
- Collection (Nested table only)
Structured data types:
- XML
- JSON

Hyperscale Compliance restricts the support of the following special characters for a user defined type name: ~!@#$%^&*()\\\"?:;,/\\\\`+=[]{}|<>'-.\")] and also restricts collection of CLOB and BLOB in user defined type.
Hyperscale Compliance restricts the support of the following special characters for a database column name: ~!@#$%^&*()\\\"?:;,/\\\\`+=[]{}|<>'-.\")]

Using multiple date formats for masking date/timestamp columns in Oracle data sources

Below are the steps to use the sample example to change the date format.

Add an environment variable for the unload service in docker-compose.yaml .

CODE

unload-service:
    environment:
      - JDBC_DATE_TIMESTAMP_FORMAT=yyyy-MM-dd HH:mm:ss.SSS

Add an environment variable for the load service in docker-compose.yaml.

CODE

load-service:
    environment:
      - SQLLDR_DATE_TIMESTAMP_FORMAT=YYYY-MM-DD HH24:MI:SS.FF

Define the date format for dataset masking inventory.

CODE

"masking_inventory": [
        {
          "field_name": "COL_TIMESTAMP",
          "domain_name": "DOB",
          "algorithm_name": "DateShiftVariable",
          "date_format": "yyyy-MM-dd HH:mm:ss.SSS"
        }
      ]

Restart the containers to reflect the changes.
Repeat the same process if you want to use another date format.

For a single dataset, mask only the tables that share the same date format.
The dataset masking inventory format and unload format should be the same.
You can build the equivalent Oracle load format from https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements004.htm#i34510.

Property values

Property	Value
`SKIP.LOAD.SPLIT.COUNT.VALIDATION`	`false`
`SKIP.UNLOAD.SPLIT.COUNT.VALIDATION`	`false`

For default values, see Configuration settings.

Known limitations

The length of the algorithm's generated masked data may exceed the target database table's column length resulting in a job failure if the target table columns use CHAR data type with BYTE length semantics to store the multibyte characters in the corresponding column. The workaround is to use an algorithm that should generate mask data with a smaller length.

MS SQL Connector

Supported versions

Microsoft SQL Server 2019

Supported data types

The following are the different data types that are tested in our lab setup:

VARCHAR
CHAR
DATETIME
INT
TEXT
VARBINARY (only unload/load)
SMALLINT
SMALLMONEY
MONEY
BIGINT
NVARCHAR
TINYINT
NUMERIC(X,Y)
DECIMAL(X,Y)
FLOAT
NCHAR
BIT
NTEXT
MONEY
Structured data types:
- XML
- JSON

Property Values

Property	Value
`SKIP.LOAD.SPLIT.COUNT.VALIDATION`	`false`
`SKIP.UNLOAD.SPLIT.COUNT.VALIDATION`	`false`

For default values, see Configuration settings .

Known Limitations

If the applied algorithm's produced mask data exceeds the corresponding target table columns datatype's max value range, then job execution will fail in load service.
Schemas, tables, and column names having special characters are not supported.
Masking of columns with VARBINARY datatype is not supported.
Hyperscale Compliance can mask up to a maximum 1000 tables in a single job.

Delimited files connector

The connector can be used to mask large delimited files. The delimited unload service splits the large files into smaller chunks and passes them onto the masking service. After the masking is completed, the files are sent to the load service which joins back the split files (the end user also has a choice to disable the join operation).

For Delimited files connector, the splitting/joining of the files is handled by a backend tool i.e. “Data Writer”. From the 17.0.0 release and onwards, you can choose the type of “Data Writer” you want to use based on your need as well as understanding the limitations of each type. The supported data writers are:

“pyarrow”: Apache Arrow is used by the connector to split/join files for the mounted filesystem target location.
“pyspark”: Apache Spark is used by the delimited-unload-service to split files. The delimited-load-service will use Linux ‘cat’ command to join back masked split files in case of mounted filesystem target location and “pyspark” writer for AWS S3 target location.
“cat”: Only applicable to delimited-load-service mounted filesystem target location, which uses the Linux cat command to join back masked split files.

Prerequisites

The source and target (NFS) locations have to be mounted onto the docker containers of unload and load service. Please note that the locations on the containers are what needs to be used when creating the connector-info’s using the controller.

CODE

# As an example
unload-service:
     image: delphix-delimited-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/nfs/mounted/source1/files:/mnt/source1
          - /path/to/nfs/mounted/source2/files:/mnt/source2
...
load-service:
     image: delphix-delimited-load-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/nfs/mounted/target1/files:/mnt/target1
          - /path/to/nfs/mounted/target2/files:/mnt/target2

Set the required data writer using the DATA_WRITER_TYPE environment variable.

CODE

unload-service:
     image: delphix-delimited-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - DATA_WRITER_TYPE=pyspark
...
load-service:
     image: delphix-delimited-load-service-app:<HYPERSCALE VERSION> 
     ...
     environment:
          ...
          - DATA_WRITER_TYPE=pyspark

Property values

Property	Value
SOURCE_KEY_FIELD_NAMES	unique_source_files_identifier
LOAD_SERVICE_REQUIREPOSTLOAD	false
DATA_WRITER_TYPE	“pyarrow” (Default for delimited-unload-service) “pyspark” “cat” (Default as well as only applicable to delimited-load-service)
UNLOAD_SPARK_DRIVER_MEMORY	90% of available memory
UNLOAD_SPARK_DRIVER_CORES	90% of available cores

For default values, see Configuration settings.

Supported data types

The following are the supported data types for delimited files hyperscale connector:

String/Text
Double
Int64
Timestamp

Known limitations

Supports only Single-character ASCII delimiters
The end-of-record character can only be \n, \r, or \r\n.
Limitations with PyArrow Data Writer:
1. Output files will exclusively enclose all string types with double quotes (`”`).
2. Columns with double data types will be converted to strings. For example, 6377974237282886994505 will be converted to “36377974237282886994505".
3. Columns with int64 data type will be converted to strings. For example, 0009435304391722556805 will be converted to “00009435304391722556805".
Limitation with PySpark Data Writer:
1. PySpark is more memory intensive, so in case we are processing data that is more in size in comparison to the available memory then we may run into issues related to resource exhaustion. Caution: The size of split files multiplied by the number of cores must not exceed the system memory.
2. With PyAarrow as the data writer, the split files are generated one after the other, so the masking-service is called as and when a split is created. With PySpark as the data writer, all split files are available only after the split process is complete. So the masking service will be only called after all splits are completed. Due to this, the overall time taken to complete the hyperscale masking execution will be more compared to the former.
3. There is a possibility that the number of splits created in the end will be less than the requested number, this generally happens when the file size is small, and spark doesn’t create as many partitions as the requested split number.

MongoDB connector

The connector can be used to mask large MongoDB files. The Mongo unload service splits the large collections into smaller chunks and passes them onto the masking service. After the masking is completed, the files are sent to the Mongo load service, which imports the masked files into the target collection.

Supported versions

Platforms

Version

Linux

MongoDB 4.4.x

MongoDB 5.0.x

MongoDB 6.0.x

Roles and privileges

MongoDB users should have the following roles and privileges:

Topology of Database

Source Database User Privileges

Target Database User
Privileges

Sharded Replica Set

role: clusterMonitor

db: admin

role: clusterAdmin,

db: admin

role: read

db: <source database>

role: readWrite,

db: <target database>

Non-Sharded Replica Set

role: clusterMonitor

db: admin

role: clusterMonitor,

db: admin

role: read,

db: <source database>

role: readWrite,

db: <target database>

Prerequisites

Mongo Unload and Mongo Load service image names are to be used under unload-service and load-service. The NFS location has to be mounted onto the Docker containers for unload and load services. Example for mounting /mnt/hyperscale.

CODE

# As an example docker-compose.yaml
unload-service:
     image: delphix-mongo-unload-service-app:${VERSION}
volumes:
   # Uncomment below lines to mount respective paths.
   - /mnt/hyperscale:/etc/hyperscale

load-service:
     image: delphix-mongo-load-service-app:${VERSION}
volumes:
   # Uncomment below lines to mount respective paths.
   - /mnt/hyperscale:/etc/hyperscale

Uncomment the below lines from docker-compose.yaml file under controller > environment:

CODE

# uncomment below for MongoDB connector
#- SOURCE_KEY_FIELD_NAMES=database_name,collection_name    
#- VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS=${VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS:-false}
#- VALIDATE_MASKED_ROW_COUNT_FOR_STATUS=${VALIDATE_MASKED_ROW_COUNT_FOR_STATUS:-false}
#- VALIDATE_LOAD_ROW_COUNT_FOR_STATUS=${VALIDATE_LOAD_ROW_COUNT_FOR_STATUS:-false}
#- DISPLAY_BYTES_INFO_IN_STATUS=${DISPLAY_BYTES_INFO_IN_STATUS:-true}
#- DISPLAY_ROW_COUNT_IN_STATUS=${DISPLAY_ROW_COUNT_IN_STATUS:-false}

Set the value of LOAD_SERVICE_REQUIRE_POST_LOAD=false inside the “.env” file.

CODE

# Set LOAD_SERVICE_REQUIRE_POST_LOAD=false for MongoDB Connector
LOAD_SERVICE_REQUIRE_POST_LOAD=false

Uncomment the below lines from “.env” file.

CODE

# Uncomment below for MongoDB Connector
#VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS=false
#VALIDATE_MASKED_ROW_COUNT_FOR_STATUS=false
#VALIDATE_LOAD_ROW_COUNT_FOR_STATUS=false
#DISPLAY_BYTES_INFO_IN_STATUS=true
#DISPLAY_ROW_COUNT_IN_STATUS=false

Property values

Mandatory changes are required for the MongoDB Connector in the docker-compose.yaml and .env files:

Property	Value
SOURCE_KEY_FIELD_NAMES	database_name,collection_name
LOAD_SERVICE_REQUIRE_POST_LOAD	false
VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS	false
VALIDATE_MASKED_ROW_COUNT_FOR_STATUS	false
VALIDATE_LOAD_ROW_COUNT_FOR_STATUS	false
DISPLAY_BYTES_INFO_IN_STATUS	true
DISPLAY_ROW_COUNT_IN_STATUS	false

For default values, see Configuration settings.

Known limitation:

In-Place Masking is not supported.
The MongoDB Hyperscale connector deployment on the Red Hat OpenShift Container Platform is not supported.

Parquet connector

The connector can be used to mask large Parquet files. The parquet unload service splits the large files into smaller chunks and passes them onto the masking service. After the masking is completed, the files are sent to the load service, which joins back the split files (you also have a choice to disable the join operation).

Prerequisites

As mounted filesystems are compatible with both source and target locations, it is necessary to mount the source and target (NFS) locations onto the docker containers of the unload and load services. Note down the locations on the containers that need to be used when creating the connector-info using the controller.

CODE

# As an example
unload-service:
     image: delphix-parquet-unload-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/nfs/mounted/source1/files:/mnt/source1
          - /path/to/nfs/mounted/source2/files:/mnt/source2
...
load-service:
     image: delphix-parquet-load-service-app:<HYPERSCALE VERSION> 
     ...
     volumes:
          ...
          - /path/to/nfs/mounted/target1/files:/mnt/target1
          - /path/to/nfs/mounted/target2/files:/mnt/target2

The connector should be able to access the AWS S3 buckets (the source and target locations). The following approaches are supported by the connector and can be used to authenticate with the S3 bucket:

Attaching the IAM role to the EC2 instance where the hyperscale masking services will be deployed.
- IAM Roles are designed for applications to securely make AWS-API requests from EC2 instances, without the necessity to manage the security credentials that the applications use.
- Using the AWS console UI or AWS CLI, attach the IAM role to the EC2 instance running the Hyperscale services. To know more, check the AWS Documentation.
- With IAM role authentication, there is no need to pass the AWS credentials during the connector-info creation.
  CODE
```
# Example connector-info payload
{
  "source": {
    "type": "AWS",
    "properties": {
        "server": "S3",
        "path": "aws_s3_bucket/sub_folder(s)"
    }
  },
  "target": {
    "type": "AWS",
    "properties": {
        "server": "S3",
        "path": "aws_s3_bucket/sub_folder(s)"
    }
  }
}
```

Passing the AWS Access Key ID & AWS Secret Access Key attached to an AWS role:

Access keys are long-term credentials generated for an IAM user or role. These keys can be for programmatic requests to the AWS CLI or AWS API (directly or using the AWS SDK). To know more, check the AWS Documentation.

These credentials can be passed during the connector-info creation.

CODE

# Example connector-info payload
{
  "source": {
    "type": "AWS",
    "properties": {
        "server": "S3",
        "path": "aws_s3_bucket/sub_folder(s)",
        "aws_region": "us-west-2",
        "aws_access_key_id": "AWS_ACCESS_KEY_ID",
        "aws_secret_access_key": "AWS_SECRET_ACCESS_KEY"
    }
  },
  "target": {
    "type": "AWS",
    "properties": {
        "server": "S3",
        "path": "aws_s3_bucket/sub_folder(s)",
        "aws_region": "us-west-2",
        "aws_access_key_id": "AWS_ACCESS_KEY_ID",
        "aws_secret_access_key": "AWS_SECRET_ACCESS_KEY"
    }
  }
}

They can also be set as environment variables when bringing up the Parquet connector services.

CODE

unload-service:
    ...
    environment:
      - AWS_DEFAULT_REGION=us-east-1
      - AWS_ACCESS_KEY_ID=<aws_access_key_id>
      - AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>

  ...
  load-service:
    ...
    environment:
      - AWS_DEFAULT_REGION=us-east-1
      - AWS_ACCESS_KEY_ID=<aws_access_key_id>
      - AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>

Property values

Configurations on the controller service:

Property	Value
`SOURCE_KEY_FIELD_NAMES`	unique_source_files_identifier
`LOAD_SERVICE_REQUIREPOSTLOAD`	false

Configuration on the parquet-unload-service:

Property	Value
`MAX_WORKER_THREADS_PER_JOB`	512

For default values, see Configuration settings.

Supported data types

The following are the supported data types for parquet files hyperscale connector:

BOOLEAN
INT32
INT64
INT96
FLOAT
DOUBLE
BYTE_ARRAY

Known limitations

Generally, the parquet files are compressed and the compression factor could vary from 2x to 70x or even more. So, when working with such larger files the connector will need a host which has large enough memory to accommodate the parallel execution of multiple large parquet files. In case the sum of the uncompressed size of parquet files that are getting executed in parallel exceeds 80% of RAM size then the chances of having an “out of memory” error are high. To avoid OOM, the end user can reduce the MAX_WORKER_THREADS_PER_JOB (i.e. reduce the number of parallel threads), ultimately reducing the memory usage.