Data source support
Oracle connector
Oracle Database (commonly referred to as Oracle RDBMS or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. The following table lists the versions that have been tested in the lab setup:
| Platforms | Version | 
|---|---|
| Linux | 
 | 
- User on source database must select privileges 
- User on target database side must have all privileges and SELECT_CATALOG_ROLE. 
Supported Data Types
The following are the different data types that are tested in our lab setup:
- VARCHAR 
- VARCHAR2 
- NUMBER 
- FLOAT 
- DATE 
- TIMESTAMP(default) 
- CLOB 
- BLOB(with text) 
- User Defined Types: - Collection (Nested table only) 
 
- Structured data types: - XML 
- JSON 
 
- Hyperscale Compliance restricts the support of the following special characters for a user defined type name: - ~!@#$%^&*()\\\"?:;,/\\\\`+=[]{}|<>'-.\")]and also restricts collection of CLOB and BLOB in user defined type.
- Hyperscale Compliance restricts the support of the following special characters for a database column name: - ~!@#$%^&*()\\\"?:;,/\\\\`+=[]{}|<>'-.\")]
Property values
| Property | Value | 
|---|---|
| 
 | 
 | 
| 
 | 
 | 
For default values, see Configuration settings.
Known limitations
The length of the algorithm's generated masked data may exceed the target database table's column length resulting in a job failure if the target table columns use CHAR data type with BYTE length semantics to store the multibyte characters in the corresponding column. The workaround is to use an algorithm that should generate mask data with a smaller length.
MS SQL Connector
Supported versions
Microsoft SQL Server 2019
Supported data types
The following are the different data types that are tested in our lab setup:
- VARCHAR 
- CHAR 
- DATETIME 
- INT 
- TEXT 
- VARBINARY (only unload/load) 
- SMALLINT 
- SMALLMONEY 
- MONEY 
- BIGINT 
- NVARCHAR 
- TINYINT 
- NUMERIC(X,Y) 
- DECIMAL(X,Y) 
- FLOAT 
- NCHAR 
- BIT 
- NTEXT 
- MONEY 
- Structured data types: - XML 
- JSON 
 
Property Values
| Property | Value | 
|---|---|
| 
 | 
 | 
| 
 | 
 | 
For default values, see Configuration settings
Known Limitations
- If the applied algorithm's produced mask data exceeds the corresponding target table columns datatype's max value range, then job execution will fail in load service. 
- Schemas, tables, and column names having special characters are not supported. 
- Masking of columns with - VARBINARYdatatype is not supported.
- Hyperscale Compliance can mask up to a maximum 1000 tables in a single job. 
File connector
In earlier versions of Hyperscale Compliance,  Delimited and Parquet connectors were released as separate connectors. From the 23.0.0 version of Hyperscale Compliance onwards, Delimited and Parquet connectors are merged into a single file connector.
The connector can be used to mask large delimited and parquet files. The file connector’s unload service splits the large files into smaller chunks and passes them onto the masking service. After the masking is completed, the files are sent to the load service which joins back the split files (the end user also has a choice to disable the join operation).
Delphix supports Delimted and Parquet as file types for file connectors. The splitting and joining of files are handled by the writers embedded in the file connector. You can choose an appropriate writer for your operations by setting an environment variable DATA_WRITER_TYPE. Users can also skip the processing of source files in unload service by enabling the staging push feature, setting an environment variable SKIP_UNLOAD_WRITERS to true will enable the staging push at the unload service. Users can also facilitate staging push in load service by setting up the environment variable to SKIP_LOAD_WRITERS.
The supported data writers are:
- pyarrow 
- pyspark 
- cat 
You can refer to the below matrix when choosing the writer type.
| Source Type | File Type | DATA_WRITER_TYPE (Unload) | DATA_WRITER_TYPE (Load) | 
|---|---|---|---|
| FS | Delimited/Parquet | pyarrow(default)/pyspark | pyarrow/cat (default) | 
| AWS | Delimited/Parquet | pyarrow(default)/pyspark | pyspark(default) | 
| HADOOP-DB/HADOOP-FS | Delimited/Parquet | pyarrow* | pyarrow* | 
| HADOOP-DB/HADOOP-FS | Delimited/Parquet | pyspark | pyspark | 
*If the source is Hadoop and the user is willing to use pyarrow as a writer, they must provide the Hadoop client as a volume mount.
- If Parquet file type contains complex data types, then use pyspark for both services. 
- If Delimited file type has a target location in an S3 bucket, then use pyspark for both services. 
- For Hadoop, ensure that you use same writer for both the services. 
Prerequisites
- The source and target (NFS) locations have to be mounted onto the docker containers of unload and load service. Please note that the locations on the containers are what needs to be used when creating the connector-info’s using the controller. CODE- # As an example unload-service: image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> ... volumes: ... - /path/to/nfs/mounted/source1/files:/mnt/source1 - /path/to/nfs/mounted/source2/files:/mnt/source2 ... load-service: image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> ... volumes: ... - /path/to/nfs/mounted/target1/files:/mnt/target1 - /path/to/nfs/mounted/target2/files:/mnt/target2
- Set the required data writer using the DATA_WRITER_TYPE environment variable. CODE- unload-service: image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> ... volumes: ... - DATA_WRITER_TYPE=pyspark ... load-service: image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> ... environment: ... - DATA_WRITER_TYPE=pyspark
- The connector should be able to access the AWS S3 buckets (the source and target locations). The following approaches are supported by the connector and can be used to authenticate with the S3 bucket: - Attaching the IAM role to the EC2 instance where the hyperscale masking services will be deployed. - IAM Roles are designed for applications to securely make AWS-API requests from EC2 instances, without the necessity to manage the security credentials that the applications use. - Using the AWS console UI or AWS CLI, attach the IAM role to the EC2 instance running the Hyperscale services. To know more, check the AWS Documentation. 
- With IAM role authentication, there is no need to pass the AWS credentials during the connector-info creation. CODE- # Example connector-info payload { "source": { "type": "AWS", "properties": { "server": "S3", "path": "aws_s3_bucket/sub_folder(s)" } }, "target": { "type": "AWS", "properties": { "server": "S3", "path": "aws_s3_bucket/sub_folder(s)" } } }
 
 
- Passing the AWS Access Key ID & AWS Secret Access Key attached to an AWS role: - Access keys are long-term credentials generated for an IAM user or role. These keys can be for programmatic requests to the AWS CLI or AWS API (directly or using the AWS SDK). To know more, check the AWS Documentation. 
- These credentials can be passed during the connector-info creation. CODE- # Example connector-info payload { "source": { "type": "AWS", "properties": { "server": "S3", "path": "aws_s3_bucket/sub_folder(s)", "aws_region": "us-west-2", "aws_access_key_id": "AWS_ACCESS_KEY_ID", "aws_secret_access_key": "AWS_SECRET_ACCESS_KEY" } }, "target": { "type": "AWS", "properties": { "server": "S3", "path": "aws_s3_bucket/sub_folder(s)", "aws_region": "us-west-2", "aws_access_key_id": "AWS_ACCESS_KEY_ID", "aws_secret_access_key": "AWS_SECRET_ACCESS_KEY" } } }
- They can also be set as environment variables when bringing up the connector services. CODE- unload-service: ... environment: - AWS_DEFAULT_REGION=us-east-1 - AWS_ACCESS_KEY_ID=<aws_access_key_id> - AWS_SECRET_ACCESS_KEY=<aws_secret_access_key> ... load-service: ... environment: - AWS_DEFAULT_REGION=us-east-1 - AWS_ACCESS_KEY_ID=<aws_access_key_id> - AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>
 
 
- Set the required data writer using the DATA_WRITER_TYPE environment variable. CODE- unload-service: image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> ... volumes: ... - DATA_WRITER_TYPE=pyspark ... load-service: image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> ... environment: ... - DATA_WRITER_TYPE=pyspark
- Set SKIP_UNLOAD_WRITERS and SKIP_LOAD_WRITERS environment variables to true to enable staging push. CODE- unload-service: image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> ... volumes: ... - SKIP_UNLOAD_WRITERS=true ... load-service: image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> ... environment: ... - SKIP_LOAD_WRITERS=true
- In case the source is HADOOP-DB, the connector requires Kerberos authentication to access the Hadoop cluster. Users must provide a principal name and keytab file that grant access to both the Hive database and the underlying HDFS. In case source is HADOOP-FS, the Kerberos principal name is necessary to retrieve files from the specified HDFS path. In both cases, the target location is always an HDFS file system. Below are the pre-requisites for using Hadoop as a connector type: - The user has to volume mount the - core-site.xml,- hadoop.keytaband- krb5.confin unload and load services. Below is an example of the same.CODE- # As an example unload-service: image: delphix-file-connector-unload-service-app:<HYPERSCALE VERSION> ... volumes: ... - /path/to/hadoop/keytab/hdfs.keytab:/app/hadoop.keytab - /path/to/hadoop/krb5.conf:/etc/krb5.conf - /path/to/hadoop/core-site.xml:/app/hadoop/etc/hadoop/core-site.xml ... load-service: image: delphix-file-connector-load-service-app:<HYPERSCALE VERSION> ... volumes: ... - /path/to/hadoop/keytab/hdfs.keytab:/app/hadoop.keytab - /path/to/hadoop/krb5.conf:/etc/krb5.conf - /path/to/hadoop/core-site.xml:/app/hadoop/etc/hadoop/core-site.xml
- In case the user is using DATA_WRITER_TYPE as pyarrow then they also have to provide Hadoop client location as volume mount in addition to above volume mounts in unload and load services. Below is example of same. CODE- - /path/to/hadoop-client/hadoop:/app/hadoop
 
- Below are a few connector info examples with hadoop as source and target: - The source and target are HADOOP-FS CODE- { "connectorName": "Hadoop_Hdfs_connector", "source": { "type": "HADOOP-FS", "properties": { "server": "<Hadoop server hostname>", "port": "<HDFS port>", "path": "Source file location", "principal_name": "<Kerberos principal name>", "protocol": "hdfs" } }, "target": { "type": "HADOOP-FS", "properties": { "server": "<Hadoop server hostname>", "path": "<target file location in HDFS>", "protocol": "hdfs", "port": "<HDFS port>", "principal_name": "<Kerberos principal name>" } } }
- The source is HADOOP-DB and target is HADOOP-FS CODE- { "connectorName": "Hadoop_Hive_connector", "source": { "type": "HADOOP-DB", "properties": { "server": "<Hadoop server hostname>", "database_name": "<database name where the tables are present>", "database_type": "hive", "database_port": "<Thrift SSL port for Apache hive>", "principal_name": "<Kerberos principal name>", "protocol": "hdfs" } }, "target": { "type": "HADOOP-FS", "properties": { "server": "<Hadoop server hostname>", "path": "<target file location in HDFS>", "protocol": "hdfs", "port": "<HDFS port>", "principal_name": "<Kerberos principal name>" } } }
 
Property values
| Property | Value | 
|---|---|
| SOURCE_KEY_FIELD_NAMES | unique_source_files_identifier | 
| LOAD_SERVICE_REQUIREPOSTLOAD | false | 
| DATA_WRITER_TYPE | 
 | 
| UNLOAD_SPARK_DRIVER_MEMORY | 90% of available memory | 
| UNLOAD_SPARK_DRIVER_CORES | 90% of available cores | 
| MAX_WORKER_THREADS_PER_JOB | 512 | 
| SKIP_UNLOAD_WRITERS | false | 
| SKIP_LOAD_WRITERS | false | 
For default values, see Configuration settings.
Supported Compression types for Parquet files
- SNAPPY 
- GZIP 
- BROTLI 
- ZSTD 
- LZ4 
- UNCOMPRESSED 
Supported data types
The following are the supported data types for delimited files:
- String/Text 
- Double 
- Int64 
- Timestamp 
The following are the supported data types for parquet files :
- BOOLEAN 
- INT32 
- INT64 
- INT96 
- FLOAT 
- DOUBLE 
- BYTE_ARRAY 
Known limitations
Delimited File types
- Supports only Single-character ASCII delimiters 
- The end-of-record character can only be \n, \r, or \r\n. 
- Limitations with PyArrow Data Writer: - Output files will exclusively enclose all string types with double quotes ( - “).
 Col umns with double data types will be converted to strings. For example, 6377974237282886994505 will be converted to “36377974237282886994505”.
- Columns with int64 data type will be converted to strings. For example, 0009435304391722556805 will be converted to “00009435304391722556805". 
 
- Limitation with PySpark Data Writer: - PySpark is more memory intensive, so in case we are processing data that is more in size in comparison to the available memory then we may run into issues related to resource exhaustion. Caution: The size of split files multiplied by the number of cores must not exceed the system memory. 
- With PyAarrow as the data writer, the split files are generated one after the other, so the masking-service is called as and when a split is created. With PySpark as the data writer, all split files are available only after the split process is complete. So the masking service will be only called after all splits are completed. Due to this, the overall time taken to complete the hyperscale masking execution will be more compared to the former. 
- There is a possibility that the number of splits created in the end will be less than the requested number, this generally happens when the file size is small, and spark doesn’t create as many partitions as the requested split number. 
 
Parquet file types
- Generally, the parquet files are compressed and the compression factor could vary from 2x to 70x or even more. So, when working with such larger files the connector will need a host which has large enough memory to accommodate the parallel execution of multiple large parquet files. In case the sum of the uncompressed size of parquet files that are getting executed in parallel exceeds 80% of RAM size then the chances of having an “out of memory” error are high. To avoid OOM, the end user can reduce the MAX_WORKER_THREADS_PER_JOB (i.e. reduce the number of parallel threads), ultimately reducing the memory usage. 
- Struct and list type values are treated as strings for the Delphix Continuous Compliance Engine, therefore, you can not add individual elements of any struct and list to the masking inventory property of the dataset payload. 
MongoDB connector
The connector can be used to mask large MongoDB files. The Mongo unload service splits the large collections into smaller chunks and passes them onto the masking service. After the masking is completed, the files are sent to the Mongo load service, which imports the masked files into the target collection.
Supported versions
| Platforms | Version | 
|---|---|
| Linux | MongoDB 4.4.x MongoDB 5.0.x MongoDB 6.0.x MongoDB 7.0.x MongoDB 8.0.x | 
Roles and privileges
MongoDB users should have the following roles and privileges:
| Topology of Database | Source Database User Privileges | Target Database User | ||
| Default | drop_collection : No | Default | drop_collection : No | |
| Sharded Replica Set | role: clusterMonitor db: admin | role: read db: <source database> | role: clusterAdmin db: admin” | role: readWrite db: target database | 
| role: read db: <source database> | role: readWrite db: <target database> | |||
| Non Sharded Replica Set | role: clusterMonitor db: admin | role: read db: <source database> | role: clusterMonitor db: admin | role: readWrite db: <target database> | 
| role: read, db: <source database> | role: readWrite db: <target database> | |||
Prerequisites
- Mongo Unload and Mongo Load service image names are to be used under unload-service and load-service. The NFS location has to be mounted onto the Docker containers for unload and load services. Example for mounting - /mnt/hyperscale.CODE- # As an example docker-compose.yaml unload-service: image: delphix-mongo-unload-service-app:${VERSION} volumes: # Uncomment below lines to mount respective paths. - /mnt/hyperscale:/etc/hyperscale load-service: image: delphix-mongo-load-service-app:${VERSION} volumes: # Uncomment below lines to mount respective paths. - /mnt/hyperscale:/etc/hyperscale
- Uncomment the below lines from - docker-compose.yamlfile under- controller > environment:
# uncomment below for MongoDB connector
#- SOURCE_KEY_FIELD_NAMES=database_name,collection_name    
#- VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS=${VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS:-false}
#- VALIDATE_MASKED_ROW_COUNT_FOR_STATUS=${VALIDATE_MASKED_ROW_COUNT_FOR_STATUS:-false}
#- VALIDATE_LOAD_ROW_COUNT_FOR_STATUS=${VALIDATE_LOAD_ROW_COUNT_FOR_STATUS:-false}
#- DISPLAY_BYTES_INFO_IN_STATUS=${DISPLAY_BYTES_INFO_IN_STATUS:-true}
#- DISPLAY_ROW_COUNT_IN_STATUS=${DISPLAY_ROW_COUNT_IN_STATUS:-false}- Set the value of - LOAD_SERVICE_REQUIRE_POST_LOAD=falseinside the “- .env”- # Set LOAD_SERVICE_REQUIRE_POST_LOAD=false for MongoDB Connector LOAD_SERVICE_REQUIRE_POST_LOAD=false
- Uncomment the below lines from “ - .env” file.CODE- # Uncomment below for MongoDB Connector #VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS=false #VALIDATE_MASKED_ROW_COUNT_FOR_STATUS=false #VALIDATE_LOAD_ROW_COUNT_FOR_STATUS=false #DISPLAY_BYTES_INFO_IN_STATUS=true #DISPLAY_ROW_COUNT_IN_STATUS=false
- To leverage Reduced Privilege Operations, you must set the - drop_collectionproperty to No. After this property is set, the connector will no longer require- clusterAdminand- clusterMonitorprivileges. The following are the implications of setting- drop_collectionproperty to No:- The connector will not validate - clusterMonitorprivilege at the source and- clusterMonitorand- clusterAdminprivileges at the target. For more information, refer to the Roles and Privileges table above.
- The connector will skip the following operations on a target collection: - shard collection 
- create shard key 
- create index 
 
 
You will be responsible for executing the above operations on a target collection.
Property values
Mandatory changes are required for the MongoDB Connector in the docker-compose.yaml and .env files:
| Property | Value | 
|---|---|
| SOURCE_KEY_FIELD_NAMES | database_name,collection_name | 
| LOAD_SERVICE_REQUIRE_POST_LOAD | false | 
| VALIDATE_UNLOAD_ROW_COUNT_FOR_STATUS | false | 
| VALIDATE_MASKED_ROW_COUNT_FOR_STATUS | false | 
| VALIDATE_LOAD_ROW_COUNT_FOR_STATUS | false | 
| DISPLAY_BYTES_INFO_IN_STATUS | true | 
| DISPLAY_ROW_COUNT_IN_STATUS | false | 
For default values, see Configuration settings.
Known limitation:
- In-Place Masking is not supported. 
Snowflake connector
The connector can be used to mask table residing under Snowflake platform. The Snowflake unload service splits data exported from source tables into smaller chunks (CSV files) and passes them onto the masking service. After the masking is completed, the files are sent to the Snowflake load service, which imports the masked files into the target database table.
Supported versions
The Snowflake platform is version-agnostic. We do not need to manage or specify snowflake database versions, as updates and maintenance are handled automatically by Snowflake.
Prerequisites
- For accessing Snowflake warehouse, you will need a service account user with key-pair authentication. (Key-pair authentication provides enhanced security). 
a. To configure key-pair authentication for a Snowflake service account, refer to the instructions provided in Snowflake’s configuring key-pair authentication documentation.
b. Use the following commands to generate base64 encoded value of encrypted private key and passphrase.
To generate base64 encoded value of encrypted private key:
echo -n `cat /home/delphix/keys/MKK_TEST_SFHSC_ADMIN_key.p8|base64`To generate base64 encode value of passphrase:
echo -n <passphrase>|base64 -w 0- Configure Snowflake storage integration to read data from and write data to an Amazon S3 bucket referenced in an external (i.e. S3) stage 
a. To configure the Snowflake storage integration, refer to the instructions provided in Snowflake’s configuring a Snowflake storage integration to access Amazon S3 documentation. 
- Configure Snowflake external stage (i.e. S3) where the data files are staged. 
a. To configure the Snowflake storage integration, refer to the instructions provided in Snowflake’s creating an S3 stage documentation.
Roles and privileges
- The table below lists all the privileges required to unload and load data from/to Snowflake platform. 
- The sample names (e.g., DLPX_WAREHOUSE, DLPX_SNOW_DB, DLPX_SNOWHSC, etc.) are for example and explanation purposes only. 
- Replace these names with your actual warehouse, database, schema, and role names as per your Snowflake environment. 
Source Snowflake database
| Object type | Object name | Privilege | Granted to role | 
|---|---|---|---|
| Warehouse | DLPX_WAREHOUSE | USAGE, OPERATE | DLPX_SNOWHSC | 
| Database | DLPX_SNOW_DB | USAGE | DLPX_SNOWHSC | 
| Schema | DLPX_SNOW_SCHEMA | USAGE | DLPX_SNOWHSC | 
| All Tables | DLPX_SNOW_DB.DLPX_SNOW_SCHEMA | SELECT | DLPX_SNOWHSC | 
| External Stage | DLPX_SNOW_DB.DLPX_SNOW_SCHEMA.DLPX_S3_EXT_STAGE | USAGE | DLPX_SNOWHSC | 
Target Snowflake database
| Object type | Object name | Privilege | Granted to role | 
|---|---|---|---|
| Warehouse | DLPX_WAREHOUSE | USAGE, OPERATE | DLPX_SNOWHSC | 
| Database | DLPX_SNOW_DB | USAGE | DLPX_SNOWHSC | 
| Schema | DLPX_SNOW_SCHEMA | USAGE, MODIFY, CREATE TABLE | DLPX_SNOWHSC | 
| All Tables | DLPX_SNOW_DB.DLPX_SNOW_SCHEMA | SELECT, INSERT, UPDATE, DELETE | DLPX_SNOWHSC | 
| External Stage | DLPX_SNOW_DB.DLPX_SNOW_SCHEMA.DLPX_S3_EXT_STAGE | USAGE | DLPX_SNOWHSC | 
Warehouse privileges: USAGE allows the role to use the warehouse and OPERATE enables suspending and resuming it.
Database privileges: USAGE allows the role to see the database and use objects within.
Schema privileges: Additional permissions such as CREATE TABLE and MODIFY are granted for schema-level management.
Table privileges: SELECT, INSERT, UPDATE, DELETE permissions for all existing tables ensure role-based access for managing data in the schema.
Stage privileges: USAGE is necessary for unloading data to and loading data from the external stage.
Property values
Snowflake Hyperscale Connector currently supports only AWS S3 as staging area.
Mandatory changes are required for the Snowflake Connector in the values.yaml file:
| Property | Value | 
|---|---|
| stagingStorageType | AWS_S3 | 
| applicationName | <staging-area-name> | 
| authMechanism | AWS_ROLE | 
| awsBucketName | <aws-s3-bucket-name> | 
| awsBucketRegion | <s3-region-name> | 
| awsBucketDelimiter | / | 
For instructions on how to configure AWS S3 as staging area, see configuring-aws-s3-bucket-as-staging-area
Mandatory changes are required for the Snowflake connector in the values-snowflake.yaml file:
| Property | Value | 
|---|---|
| sourceKeyFieldNames | database_name,schema_name,table_name,stage_name | 
| loadServiceRequirepostload | false | 
| validateUnloadRowCountForStatus | false | 
| validateMaskedRowCountForStatus | false | 
| validateLoadRowCountForStatus | false | 
| displayBytesInfoInStatus | true | 
| displayRowCountInStatus | true | 
| snowflakePrivateKey | <SNOWFLAKE_PRIVATE_KEY> | 
| snowflakePassphrase | <SNOWFLAKE_PASSPHRASE> | 
| unload:  | delphix-snowflake-unload-service-app | 
| load:  | delphix-snowflake-load-service-app | 
For default values, see Configuration settings.
Known limitation
In-Place Masking is not supported.
