How to setup a Hyperscale Compliance job
Pre-checks
You must check the following before starting a job:
Storage space must be twice the size of the source data for staging area.
You must have sufficient storage in the target DB for loading the masked data.
You must check and increase the size of the temporary tablespace in Oracle. For example, if you have 4 billion rows, then you must use 100G.
You must check and provide the required permission(i.e. 770 or 700) after creating an empty VDB(or mounting an NFS share) on the mounted folder on the Hyperscale Compliance host.
Based on the unmask value for the user that is used to mount, the permissions for the staging area directory could get altered after the empty VDB or NFS share has been mounted. In such cases, you must re-apply the permissions (i.e. 770 or 700) on the staging area directory.
You must restart the services after changing the permission on VDB mounted folder in case you already have created the containers.
Continuous Compliance Engine should be cleaned up before use and should only be used with Hyperscale Job. Any other masking job on Continuous Compliance Engine apart from Hyperscale Compliance Orchestrator will impact the performance of Hyperscale Compliance jobs.
Currently, the Hyperscale Compliance Orchestrator doesn’t provide the ability to allow you to configure the masking job behavior in case of non-conformant data and does not process non-conformant data warnings from the Delphix Continuous Compliance Engine. Therefore, it is recommended to verify the value of
DefaultNonConformantDataHandling
algorithm group setting on all the Hyperscale Compliance Orchestrator. For more information, refer to the Algorithm Group Settings section. It is recommended to set the value to FAIL so that Hyperscale Job will also fail instead of leaving the data unmasked.If the table that you are masking has a column type of BLOB/CLOB, then you must have a minimum of 2GB memory per CLOB/BLOB column. Depending upon the unload-split you are using, you may need to increase this memory in multiple of that. For example, if you have 4 tables (each with 1 column as BLOB/CLOB type) and unload-split is 3, then your memory requirement on the Hyperscale Compliance host will be:
(4(no. of tables) x 2(memory required per CLOB/BLOB column) x 3(unload-split used)GB + 16 GB (minimum required memory for running Hyperscale Compliance Orchestrator) = 40 GB approx
.
API Flow to Setup a Hyperscale Compliance Job
The following is the API flow for setting up and executing a Hyperscale Compliance job.
Register Continuous Compliance Engine(s)
Create Connector Info
Upload format file using
POST /file-upload
endpoint. For more information, refer to https://delphixdocs.atlassian.net/wiki/spaces/~5ceab98e6f51c10fbd18ae6d/pages/230229360. [Required only for those who need to mask embedded XML/JSON data]Create Structured Data Format [Required only for those who need to mask embedded XML/JSON data]
Create a Dataset
Create a Job
Create Execution
The following are the sample API requests/responses for a typical Hyperscale Compliance job execution workflow. The APIs can be accessed using a swagger-based API client by accessing the following URL; https://<hyperscale-compliance-host-address>/hyperscale-compliance
.
APIs must be called only in the below order.
Engines API
POST /engines (register an engine):
Request:
{
"name": "Delphix Continuous Compliance Engine 6.0.14.0 on AWS",
"type": "MASKING",
"protocol": "http",
"hostname": "de-6014-continuous-compliance.delphix.com",
"username": "hyperscale_compliance_user",
"password": "password123"
}
Response:
{
"id": 1,
"name": "Delphix Continuous Compliance Engine 6.0.14.0 on AWS",
"type": "MASKING",
"protocol": "http",
"hostname": "de-6014-continuous-compliance.delphix.com",
"username": "hyperscale_compliance_user",
"ssl": true,
"ssl_hostname_check": true
}
ConnectorInfo API
POST /connector-info (create connector info for Hyperscale Compliance)
Oracle Request:
{
"source": {
"jdbc_url": "jdbc:oracle:thin:@oracle-19-src.dlpxdc.co:1521/VDBOMSRDC20SRC",
"user": "oracle_db_user",
"password": "password123"
},
"target": {
"jdbc_url": "jdbc:oracle:thin:@rh79-ora-19-tgt.dlpxdc.co:1521/VDBOMSRDC200B_TGT",
"user": "oracle_db_user",
"password": "password123"
}
}
Oracle Response:
{
"id": 1,
"source": {
"jdbc_url": "jdbc:oracle:thin:@oracle-19-src.dlpxdc.co:1521/VDBOMSRDC20SRC",
"user": "oracle_db_user"
},
"target": {
"jdbc_url": "jdbc:oracle:thin:@rh79-ora-19-tgt.dlpxdc.co:1521/VDBOMSRDC200B_TGT",
"user": "oracle_db_user"
}
}
Example 2: This example is for the cases where either username or password needs to be in either uppercase or camel case
Oracle Request:
{
"source": {
"user": "\"y2ijf0oj2\"",
"password": "\"xyz\"",
"jdbc_url": "jdbc:oracle:thin:@xyz.com:1521/DBOMSRDC200B",
"connection_properties": {}
},
"target": {
"jdbc_url": "jdbc:oracle:thin:@xyx.com:1521/DBOMSRDC200B",
"user": "\"y2ijf0oj2\"",
"password": "\"xyz\"",
"connection_properties": {}
}
}
Oracle Response:
{
"id": 1,
"source": {
"user": "\"y2ijf0oj2\"",
"jdbc_url": "jdbc:oracle:thin:@xyz.com:1521/DBOMSRDC200B",
"connection_properties": {}
},
"target": {
"jdbc_url": "jdbc:oracle:thin:@xyx.com:1521/DBOMSRDC200B",
"user": "\"y2ijf0oj2\"",
"connection_properties": {}
}
}
MSSQL Request:
{
"source": {
"jdbc_url": "jdbc:sqlserver://hyperscale-mssql.dlpxdc.co;database=SourceDB2019;instanceName=SQL2019",
"user": "sa",
"password": "password123"
},
"target": {
"jdbc_url": "jdbc:sqlserver://hyperscale-mssql.dlpxdc.co;database=SourceDB2019;instanceName=SQL2019;",
"user": "sa",
"password": "password123"
}
}
MSSQL Response:
{
"id": 1,
"source": {
"user": "sa",
"jdbc_url": "jdbc:sqlserver://hyperscale-mssql.dlpxdc.co;database=SourceDB2019;instanceName=SQL2019"
},
"target": {
"jdbc_url": "jdbc:sqlserver://hyperscale-mssql.dlpxdc.co;database=SourceDB2019;instanceName=SQL2019;",
"user": "sa",
}
}
MongoDB Connector Request:
{
"connectorName": "MongoDB_Connector",
"source": {
"mongo_url": "mongodb://<hostname>:<port>/?authSource=<authSource>[&<other_options>]",
"user": "mongo_user",
"password": "mongo_password"
},
"target": {
"mongo_url": "mongodb://<hostname>:<port>/?authSource=<authSource>[&replicaSet=<mongo_rs>&tls=true&tlsCertificateKeyFile=<cert_path>]",
"user": "mongo_user",
"password": "mongo_password"
}
}
MongoDB Connector Response:
{
"source": {
"mongo_url": "mongodb://mongodb-src.example:27017",
"user": "mongo_user",
},
"target": {
"mongo_url": "mongodb://mongodb-tgt.example.com:27017",
"user": "mongo_user",
}
}
A failure in the load may leave the target datasource in an inconsistent state since the load step truncates the target when it begins. If the source and target data source are configured to be the same datasource and a failure occurs in the load step, it is recommended that the single datasource be restored from a backup(or use the continuous data engine's rewind feature if you have a VDB as the single datasource) after the failure in the load step as the datasource may be in an inconsistent state. After the datasource is restored, you may kick off another hyperscale job. If the source and target data source are configured to be different, you may use the Hyperscale Compliance Orchestrator restartability feature to restart the job from the point of failure in the load/post-load step.
Delimited/Parquet Request:
Example1 - Without AWS S3 bucket credentials:
{
"source": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_source/sub_folder"
}
},
"target": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_target/sub_folder"
}
}
}
Example2 - With AWS S3 bucket credentials:
{
"source": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_source/sub_folder",
"aws_region": "us-east-1",
"aws_access_key_id": "AKIAYHUJKLDHMB",
"aws_secret_access_key": "x2IXoHDYHhdydmmm&h12563kaka",
"aws_role_arn": "56436882398"
}
},
"target": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_target/sub_folder",
"aws_region": "us-east-1",
"aws_access_key_id": "AKIAYHUJKLDHMB",
"aws_secret_access_key": "x2IXoHDYHhdydmmm&h12563kaka",
"aws_role_arn": "56436882398"
}
}
}
Overview of the parameters:
type: Refers to the type of connector you are creating. Delphix currently supports “AWS” which refers to “Cloud vendor”, indicating that all the source and target files are available within the container through AWS credentials.
properties: Holds the server and path values.
server: Points to the S3 location. Currently, this is ignored as ‘AWS’ is the only supported type.
path: The path to the AWS source/target S3 bucket location. This path needs not be the exact path to the files, but a parent directory. In case you are planning to use a profiler, then this path must be the exact path to the parquet files.
Additional parameters (in case AWS credentials need to be passed separately):
aws_region: The AWS region the S3 bucket is part of.
aws_access_key_id: The AWS access key ID generated for the AWS Role.
aws_secret_access_key: The AWS secret access key generated for the AWS Role.
aws_role_arn: The AWS Role Identifier
Example 3 - With Mounted Filesystem:
{
"id": 1,
"connectorName": "Parquet_Connector_FS",
"source": {
"type": "FS",
"properties": {
"server": "local",
"path": "/mnt/source"
}
},
"target": {
"type": "FS",
"properties": {
"server": "local",
"path": "/mnt/target"
}
}
}
Here `type=FS` means “File System”. Currently, the Parquet connector only supports files mounted directly onto the docker container, i.e. available on the container file system. In the above example `/mnt/source` & `/mnt/target` are paths inside the container that denote the source and target location respectively.
In case of staging push, the source and target locations must share a mount point with the staging area. The user needs to specify the volume path as the parent directory. For instance, if the parent path of the staging area is /etc/hyperscale
, then the source and target paths would be something like /etc/hyperscale/source
and /etc/hyperscale/target
. The user should place their source files in /etc/hyperscale/source
, and they may optionally include format files following the naming convention <unique_source_identifier>.fmt
. The output files will be available in /etc/hyperscale/target
.
Overview of the parameters:
type: Refers to the type of connector you are creating. In the above example, we have provided the type as “FS“ which refers to Filesystem.
properties: Holds the server and path values.
server: Currently for “FS“ we only allow “local“ as the server.
path: This is the path for source and target locations.
Example 4 - With HADOOP-FS using Kerberos:
{
"connectorName": "Hadoop_Hdfs_connector",
"source": {
"type": "HADOOP-FS",
"properties": {
"server": "hadoop-server-hostname.co",
"path": "/path/to/source/hdfs/files",
"protocol": "hdfs",
"port": 8020,
"principal_name": "principal_name/somedomain@REALM.COM"
}
},
"target": {
"type": "HADOOP-FS",
"properties": {
"server": "hadoop-server-hostname.co",
"path": "/path/to/target/hdfs/files",
"protocol": "hdfs",
"port": 8020,
"principal_name": "principal_name/somedomain@REALM.COM"
}
}
}
Overview of the parameters:
type: Refers to the type of connector you are creating. In the above example, we are using the type as “HADOOP-FS” which refers to “HDFS”, indicating that all the source and target files are available within the HDFS.
properties: Holds the server and path values.
server: Points to the Hadoop server.
path: The path to the HDFS source/target location.
Additional parameters:
principal_name: This is required for the Kerberos authentication, the principal name must have access assigned in the user provided keytab entry.
protocol: User can set this protocol as per their Hadoop configurations.
port: This is HDFS port, by default it will be 8020.
Example 5 - With HADOOP-DB using Kerberos:
{
"connectorName": "Hadoop_Hive_connector",
"source": {
"type": "HADOOP-DB",
"properties": {
"server": "hadoop-server-hostname.co",
"database_name": "hive",
"database_type": "hive",
"database_port": "10000",
"principal_name": "principal_name/somedomain@REALM.COM",
"protocol": "hdfs"
}
},
"target": {
"type": "HADOOP-FS",
"properties": {
"server": "hadoop-server-hostname.co",
"path": "/path/to/hdfs/files",
"protocol": "hdfs",
"port": 8020,
"principal_name": "principal_name/somedomain@REALM.COM"
}
}
}
Overview of the parameters:
type: Refers to the type of connector you are creating. In the above example, we are using the type as “HADOOP-DB” which refers to “HDFS via Hive database” for source and “HADOOP-FS” for target.
properties: Holds the server and path values.
server: Points to the Hadoop server.
path: The path to the HDFS source/target location.
Additional parameters (in case AWS credentials need to be passed for separately):
principal_name: This is required for the kerberos authentication, the principal name must have access assigned in the user provided keytab entry.
database_type: Currently we only support Hive as a supported database.
database_name: Hive database name.
database_port: This is a thrift port for accessing the database.
protocol: User can set this protocol as per their hadoop configurations.
port: This is HDFS port, by default it will be 8020.
File connector different Responses:
Example 1
{
"source": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_source/sub_folder"
}
},
"target": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_target/sub_folder"
}
}
}
Example 2
{
"source": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_source/sub_folder",
"aws_region": "us-east-1",
"aws_access_key_id": "AKIA********",
"aws_secret_access_key": "x2IX********",
"aws_role_arn": "56436882398"
}
},
"target": {
"type": "AWS",
"properties": {
"server": "S3",
"path": "s3_bucket_target/sub_folder",
"aws_region": "us-east-1",
"aws_access_key_id": "AKIA********",
"aws_secret_access_key": "x2IX********",
"aws_role_arn": "56436882398"
}
}
}
Example 3
{
"id": 1,
"connectorName": "Parquet_Connector_FS",
"source": {
"type": "FS",
"properties": {
"server": "local",
"path": "/mnt/source"
}
},
"target": {
"type": "FS",
"properties": {
"server": "local",
"path": "/mnt/target"
}
}
}
Example 4
{
"connectorName": "Hadoop_Hdfs_connector_2",
"source": {
"type": "HADOOP-FS",
"properties": {
"server": "hadoop-server-hostname.co",
"port": "8020",
"path": "/source/file/location",
"principal_name": "PRINCIPAL_NAME@DOMAIN",
"protocol": "hdfs"
}
},
"target": {
"type": "HADOOP-FS",
"properties": {
"server": "hadoop-server-hostname.co",
"path": "/target/file/location",
"protocol": "hdfs",
"port": "8020",
"principal_name": "PRINCIPAL_NAME@DOMAIN"
}
}
}
Example 5
{
"connectorName": "Hadoop_Hive_connector_2",
"source": {
"type": "HADOOP-DB",
"properties": {
"server": "hadoop-server-hostname.co",,
"database_name": "default",
"database_type": "hive",
"database_port": "10000",
"principal_name": "PRINCIPAL_NAME@DOMAIN",
"protocol": "hdfs"
}
},
"target": {
"type": "HADOOP-FS",
"properties": {
"server": "hadoop-server-hostname.co",,
"path": "/target/file/location",
"protocol": "hdfs",
"port": "8020",
"principal_name": "PRINCIPAL_NAME@DOMAIN"
}
}
}
StructuredDataFormat APIs
This functionality is specifically needed when there is a requirement to mask XML/JSON data contained within database columns. These endpoints require a file_upload_ref
that can be generated via the POST /file-upload
endpoint wherein you must upload the format of the XML/JSON data to be masked. For more information, refer to the Hyperscale Compliance API documentation.
POST /structured-data-format (create structured-data-format for a column)
Request:
{
"file_upload_ref": "delphix-file://upload/f_XXXX/XXX.xml",
"doc_type": "XML",
"masking_inventory_paths": [
{
"path": "/catalog/book/author",
"domain_name": "NULL_SL",
"algorithm_name": "dlpx-core:FullName"
},
{
"path": "/catalog/book/isbn",
"domain_name": "ADDRESS",
"algorithm_name": "dlpx-core:CM Alpha-Numeric"
}
]
}
Response:
{
"structured_data_format_id":1,
"file_upload_ref": "delphix-file://upload/f_XXXX/XXX.xml",
"doc_type": "XML",
"masking_inventory_paths": [
{
"path": "/catalog/book/author",
"domain_name": "NULL_SL",
"algorithm_name": "dlpx-core:FullName"
},
{
"path": "/catalog/book/isbn",
"domain_name": "ADDRESS",
"algorithm_name": "dlpx-core:CM Alpha-Numeric"
}
]
}
DataSets API
Table and schema names are case-sensitive.
For the MSSQL connector, it's recommended to provide filter_key if the unload_split count is more than 1, and the table does not contain a primary/unique key, else data will be unloaded through a sequential approach which may be slower. If filter_key is not provided and the table includes a primary/unique key then Hyperscale will scan the table to fetch the key automatically.
The dataset date format should be the same as the environment variable date format value.
In one dataset, all the inventories should use the same date format for all date formats.
POST /data-sets (create dataset for Hyperscale Compliance)
Alternatively, you can create or update the dataset using payload in a file with the below endpoints:
POST /data-sets/file-upload
PUT /data-sets/file-upload/{dataSetId}
The above endpoints require a file_upload_ref
that can be generated via the POST /file-upload
endpoint. For more information, refer to the Hyperscale Compliance API documentation.
Request (with single table):
{
"connector_id": 1,
"data_info": [
{
"source": {
"schema_name": "SCHEMA_1",
"table_name": "TABLE_1",
"unload_split": 4
},
"target": {
"schema_name": "SCHEMA_1_TARGET",
"table_name": "TABLE_1_TARGET",
"stream_size": 65536
},
"masking_inventory": [
{
"field_name": "FIRST_NAME",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "LAST_NAME",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
},
{
"field_name": "XmlData",
"structured_data_format_id": 1
}
]
}
]
}
Response (with single table):
{
"id": 1,
"connector_id": 1,
"data_info": [
{
"source": {
"schema_name": "SCHEMA_1",
"table_name": "TABLE_1",
"unload_split": 4
},
"target": {
"schema_name": "SCHEMA_1",
"table_name": "TABLE_1",
"stream_size": 65536
},
"masking_inventory": [
{
"field_name": "FIRST_NAME",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "LAST_NAME",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
},
{
"field_name": "XmlData",
"structured_data_format_id": 1
}
]
}
]
}
Request (with single table & filter_key in the source):
{
"connector_id": 1,
"data_info": [
{
"source": {
"schema_name": "SCHEMA_1",
"table_name": "TABLE_1",
"unload_split": 4,
"filter_key": "PKID"
},
"target": {
"schema_name": "SCHEMA_1_TARGET",
"table_name": "TABLE_1_TARGET",
"stream_size": 65536
},
"masking_inventory": [
{
"field_name": "FIRST_NAME",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "LAST_NAME",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
},
{
"field_name": "XmlData",
"structured_data_format_id": 1
}
]
}
]
}
Response (with single table & filter_key in the source):
{
"id": 1,
"connector_id": 1,
"data_info": [
{
"source": {
"schema_name": "SCHEMA_1",
"table_name": "TABLE_1",
"unload_split": 4,
"filter_key": "PKID"
},
"target": {
"schema_name": "SCHEMA_1",
"table_name": "TABLE_1",
"stream_size": 65536
},
"masking_inventory": [
{
"field_name": "FIRST_NAME",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "LAST_NAME",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
},
{
"field_name": "XmlData",
"structured_data_format_id": 1
}
]
}
]
}
Request (with multiple tables):
{
"connector_id": 1,
"data_info": [
{
"source": {
"unload_split": 2,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_0"
},
"target": {
"stream_size": 65536,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_0"
},
"masking_inventory": [
{
"field_name": "col_VARCHAR",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "XmlData",
"structured_data_format_id": 1
}
]
},
{
"source": {
"unload_split": 2,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_1"
},
"target": {
"stream_size": 65536,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_1"
},
"masking_inventory": [
{
"field_name": "COL_TIMESTAMP",
"domain_name": "DOB",
"algorithm_name": "DateShiftVariable",
"date_format": "yyyy-MM-dd HH:mm:ss.SSS" -->(optional field, this needs to be added only while working with date/time masking)
}
]
}
]
}
Response (with multiple tables):
{
"id": 1,
"connector_id": 1,
"data_info": [
{
"source": {
"unload_split": 2,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_0"
},
"target": {
"stream_size": 65536,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_0"
},
"masking_inventory": [
{
"field_name": "col_VARCHAR",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "XmlData",
"structured_data_format_id": 1
}
]
},
{
"source": {
"unload_split": 2,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_1"
},
"target": {
"stream_size": 65536,
"schema_name": "DLPXDBORA",
"table_name": "test_multi_1"
},
"masking_inventory": [
{
"field_name": "COL_TIMESTAMP",
"domain_name": "DOB",
"algorithm_name": "DateShiftVariable",
"date_format": "yyyy-MM-dd HH:mm:ss.SSS"
}
]
}
]
}
File Connector with Delimited Files DataSet request:
{
"connector_id": 1,
"data_info": [
{
"source": {
"delimiter": "|",
"endOfRecord": "\n",
"enclosure": "\"",
"enclosureEscapeCharacter": "\\",
"escapeEnclosureEscapeCharacter": false,
"unique_source_files_identifier": "file_identifier1",
"has_headers": false,
"unload_split": 100,
"source_files": [
"file1.txt",
"file2.txt"
]
},
"target": {
"perform_join": true
},
"masking_inventory": [
{
"field_name": "f3",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "f5",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
}
]
}
]
}
In case of staging push, the split count must always be 1. The source files should not have headers (or columns).
The source DataSet consists of the following parameters:
delimiter: The single character length delimiter used in source files.
endOfRecord : The end of line character, currently we only support
`\n`
,`\r`
&`\r\n`
.enclosure: The single character length quote character used in the source files.
enclosureEscapeCharacter: The escape character used to escape quote characters.
unique_source_files_identifier: This is the source key that maps the load-service and masking-service data sets with the unload-service data set. Please ensure that this value is different for each item in the DataSet data_info list.
has_headers: A flag that indicates if the character source files have header column names or not.
If set to true, format files with the same column names are created and the same can be used for the masking inventory.
If set to false, the column names of pattern f0, f1, f2, and so on are used to create the format files for delimited file masking. When adding the masking inventory please make sure to use field_name with values f0, f1, f2, and so on.
unload_split: The number of splits that the files in the source_files list have to be split into. Please ensure that the split number is not too small nor too big for better overall performance.
source_files: List of all source files that need to be masked and adhere to the following rules:
All files should have the same delimiter character and other helper characters.
All files should have the same number of columns and the same column names if it has a header line.
Supported input formats:
Relative path to the file - relative to the configured connector source path. Examples (considering that the source path within the container is /mnt/source):
If the absolute path of the file was
`/mnt/source/file1.txt`
, then the source_files list should only contain`files1.txt`
.If the absolute path of the file was
`/mnt/source/some_dir/file1.txt`
, then the source_files list should contain the value`some_dir/file1.txt`
.
Blob pattern - the path to all files matching a blob. Examples:
If we want to mask files
`/mnt/source/file1.txt`
and`/mnt/source/file2.txt`
, then the source_files list can contain the value`file*.txt`
.If we want to mask all files within a directory, we can add
`directory/*`
to the source_files list.
The target DataSet consists of the following parameters:
perform_join: A flag to check if the load-service joins back all masked files or not.
If set to true, the load-service will join back masked files and keep the same relative file location structure in the target path (considering that the target path within the container is /mnt/target).
If the input was
`file1.txt`
, then the output file will be`/mnt/target/file1.txt`
If the input was
`some_dir/file1.txt`
, then the output file will be`/mnt/target/some_dir/file1.txt`
.
If set to false, the split masked files will be placed in the same relative file location structure in the target path with a split number appended to the file name.
If the input was
`file1.txt`
, then the output files will be`/mnt/target/file1_0.txt`
,`/mnt/target/file1_1.txt`
, and so on.If the input was `some_dir/file1.txt`, then the output files will be
`/mnt/target/some_dir/file1_0.txt`
,`/mnt/target/some_dir/file1_1.txt`
, and so on.
The masking inventory remains the same, except when there are no column names in the delimited files and the connector assumes that the column names would be `f0`, `f1`, `f2`
, and so on. These are field names that have to be used to fill up the masking inventory (as shown in the example above).
File Connector Delimited Files DataSet response:
{
"id": 1,
"connector_id": 1,
"data_info": [
{
"source": {
"delimiter": "|",
"endOfRecord": "\n",
"enclosure": "\"",
"enclosureEscapeCharacter": "\\",
"escapeEnclosureEscapeCharacter": false,
"unique_source_files_identifier": "file_identifier1",
"has_headers": false,
"unload_split": 100,
"source_files": [
"file1.txt",
"file2.txt"
]
},
"target": {
"perform_join": true
},
"masking_inventory": [
{
"field_name": "f3",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "f5",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
}
]
}
]
}
File Connector with Parquet DataSet request:
{
"connector_id": 1,
"data_info": [
{
"source": {
"unique_source_files_identifier": "file_identifier1",
"unload_split": 100,
"file_type": "parquet",
"source_files": [
"relative_folder1/file1.parquet",
"relative_folder2/file2.parquet"
]
},
"target": {
"perform_join": true
},
"masking_inventory": [
{
"field_name": "colume_name1",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "colume_name2",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
}
]
}
]
}
The Hyperscale Parquet Profiler can be used to analyze the source parquet files and help generate the DataSet with the masking inventory. To know more, visit the Parquet Profiler documentation.
The source dataset consists of the following parameters:
unique_source_files_identifier: This is the source key that maps the load-service and masking-service data sets with the unload-service data set. Please ensure that this value is different for each item in the DataSet data_info list.
unload_split: The number of splits that the files in the source_files list have to be split into. Please ensure that the split number is not too small nor too big for better overall performance.
file_type: The file type should be set to “parquet”.
source_files: List of all source files that need to be masked and adhere to the following rules:
All files should have the same delimiter character and other helper characters.
All files should have the same number of columns and the same column names if it has a header line.
Supported input formats:
Relative path to the file, relative to the configured connector source path. Examples (considering that the source path within the container is s3_bucket/source):
If the file's absolute path was
s3_bucket/source/file1.parquet
, then the source_files list should only containfiles1.parquet
.If the absolute path of the file was
s3_bucket/source/some_dir/file1.parquet
, then the source_files list should contain the valuesome_dir/file1.parquet
.
Blob pattern, the path to all files matching a blob. Examples:
If we want to mask files
s3_bucket/source/file1.parquet
ands3_bucket/source/file2.parquet
, then the source_files list can contain the valuefile*.parquet
.If we want to mask all files within a directory, we can add
directory/*
to the source_files list.
The target dataset consists of the following parameters:
perform_join: A flag to check if the load-service joins back all masked files or not.
If set to true, the load-service will join back masked files and keep the same relative file location structure in the target path (considering that the target path within the container is s3_bucket/target).
If the input was
file1.parquet
, then the output file will bes3_bucket/target/file1.parquet
If the input was
some_dir/file1.parquet
, then the output file will bes3_bucket/target/some_dir/file1.parquet
.
If set to false, the split masked files will be placed in the same relative file location structure in the target path with a split number appended to the file name.
If the input was
file1.parquet
, then the output files will bes3_bucket/target/file1_0.parquet
,s3_bucket/target/file1_1.parquet
, and so on.If the input was
some_dir/file1.txt
, then the output files will bes3_bucket/target/some_dir/file1_0.parquet
,s3_bucket/target/some_dir/file1_1.parquet
, and so on.
File Connector with Parquet DataSet response:
{
"connector_id": 1,
"data_info": [
{
"source": {
"unique_source_files_identifier": "file_identifier1",
"unload_split": 100,
"file_type": "parquet",
"source_files": [
"<relative_path>/file1.parquet",
"<relative_path>/file2.parquet"
]
},
"target": {
"perform_join": true
},
"masking_inventory": [
{
"field_name": "colume_name1",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "colume_name2",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
}
]
}
]
}
File Connector with Hive tables DataSet request:
{
"connector_id": 1,
"data_info": [
{
"source": {
"delimiter": "|",
"endOfRecord": "\n",
"enclosure": "\"",
"enclosureEscapeCharacter": "\\",
"escapeEnclosureEscapeCharacter": false,
"unique_source_files_identifier": "file_identifier1",
"has_headers": false,
"unload_split": 100,
"source_files": [
"table1",
"table2:part1"
]
},
"target": {
"perform_join": true
},
"masking_inventory": [
{
"field_name": "f3",
"domain_name": "FIRST_NAME",
"algorithm_name": "FirstNameLookup"
},
{
"field_name": "f5",
"domain_name": "LAST_NAME",
"algorithm_name": "LastNameLookup"
}
]
}
]
}
The source dataset consists of the following parameters:
unique_source_files_identifier: This is the source key that maps the load-service and masking-service data sets with the unload-service data set. Please ensure that this value is different for each item in the DataSet data_info list.
unload_split: The number of splits that the files in the source_files list have to be split into. Please ensure that the split number is not too small nor too big for better overall performance.
source_files: List of all tables with or without partitions.
MongoDB DataSet request:
{
"connector_id": 1,
"data_info": [
{
"source": {
"database_name": "SRC_DB",
"collection_name": "SRC_COLLECTION",
"unload_split": 10
},
"target": {
"database_name": "TGT_DB",
"collection_name": "TGT_COLLECTION",
"drop_collection" : "No" --> (Optional. If set to No, target collection will not be dropped before loading)
},
"masking_inventory": [
{
"field_name": "$[*]['relationships'][*]['person']['first_name']",
"domain_name": "FIRST_NAME",
"algorithm_name": "dlpx-core:FullName"
},
{
"field_name": "$[*]['address']",
"domain_name": "ADDRESS",
"algorithm_name": "dlpx-core:CM Alpha-Numeric"
}
]
}
]
}
unload_split: The number of splits that the MongoDB collection has to be split into. You must make sure that the split number is neither too small nor too big for a better overall performance.
drop_collection: It is optional. To leverage Reduced Privilege Operations, you must set property drop_collection to No. After The property is set, the connector will no longer require
clusterAdmin
andclusterMonitor
privileges. Check the pre-requisites section for more details.masking_inventory: The masking inventory for MongoDB collection can be generated by using the MongoDB Profiler service, which will call the dataset API with the generated masking_inventory and create the dataset. MongoDB Profiler artifact includes a README file that provides detailed usage instructions.
MongoDB DataSet response:
{
"connector_id": 1,
"data_info": [
{
"source": {
"database_name": "SRC_DB",
"collection_name": "SRC_COLLECTION",
"unload_split": 10
},
"target": {
"database_name": "TGT_DB",
"collection_name": "TGT_COLLECTION"
},
"masking_inventory": [
{
"field_name": "$[*]['relationships'][*]['person']['first_name']",
"domain_name": "FIRST_NAME",
"algorithm_name": "dlpx-core:FullName"
},
{
"field_name": "$[*]['address']",
"domain_name": "ADDRESS",
"algorithm_name": "dlpx-core:CM Alpha-Numeric"
}
]
}
]
}
Algorithm and Domain names to be provided in the Data Set request should be used from Continuous Compliance Engine. The Continuous Compliance Engine APIs that could be used to get these names are:
Get all algorithms (
GET /algorithms
) for Algorithm Names. Sample Endpoint: https://maskingdocs.delphix.com/maskingApiEndpoints/5_1_15_maskingApiEndpoints.html#getAllAlgorithmsGet all domains (
GET /domains
) for Domain Names. Sample Endpoint: https://maskingdocs.delphix.com/maskingApiEndpoints/5_1_15_maskingApiEndpoints.html#getAllDomains
To check about extra parameters that need to be provided in the Data Set request for Date and Multi Column Algorithms, refer to Model DataSet_masking_inventory
on the Hyperscale Compliance API Documentation page available in the API Reference section of this Documentation.
Alternatively, you can create/update the dataset using payload in a file with below end-points:
POST /data-sets/file-upload
PUT /data-sets/file-upload/{dataSetId}
Above endpoints requires a file_upload_ref, which can be generated via POST /file-upload
endpoint. See Hyperscale Compliance API.
Jobs API
POST /jobs (Create a Hyperscale Compliance job)
{
"name": "job_1",
"masking_engine_ids": [
1
],
"data_set_id": 1,
"app_name_prefix": "app",
"env_name_prefix": "env",
"consider_continuous_compliance_warning_event_as": "SUCCESS",
"retain_execution_data": "NO",
"source_configs": {
"max_concurrent_source_connection": 30
},
"target_configs": {
"max_concurrent_target_connection": 30,
"parallelism_degree": 15
},
"masking_job_config": {
"max_memory": 1024,
"min_memory": 0,
"description": "Job created by Hyperscale Masking",
"feedback_size": 100000,
"stream_row_limit": 10000,
"num_input_streams": 1
}
}
For more information on
retain_execution_data flag
, see Cleaning Up Execution Data.In the case of Oracle, set parallelism_degree in the target_configto use the degree of parallelism while re-creating the indexes in the post-load step.
Set {"parallelism_degree" :0 }: Use unmodified DDL provided by Oracle.
Set {"parallelism_degree": -1}: Remove any parallel or nonparallel clause from the DDL.
Set {"parallelism_degree": any positive vale}: Remove existing parallel degree or nonparallel clause and add Parallel <parallelism_degree> to the DDL.
Response:
Below properties are only applicable to Oracle Datasource.
{
"id": 1,
"name": "Test_Job",
"masking_engine_ids": [
1,
2,
3
],
"data_set_id": 1,
"app_name_prefix": "Test_App",
"env_name_prefix": "Test_Env",
"consider_continuous_compliance_warning_event_as": "SUCCESS",
"retain_execution_data": "NO",
"source_configs": {
"max_concurrent_source_connection": 30
},
"target_configs": {
"max_concurrent_target_connection": 30
},
"masking_job_config": {
"max_memory": 1024,
"min_memory": 1024,
"description": "Job created by Hyperscale Masking",
"feedback_size": 100000,
"stream_row_limit": 10000,
"num_input_streams": 1
}
}
File Connector Job request:
In File Connector jobs, source_configs and target_configs objects can be used to configure various job specific parameters.
{
"name": "job_1",
"masking_engine_ids": [
1
],
"data_set_id": 1,
"app_name_prefix": "app",
"env_name_prefix": "env",
"consider_continuous_compliance_warning_event_as": "SUCCESS",
"retain_execution_data": "NO",
"masking_job_config": {
"max_memory": 1024,
"min_memory": 0,
"description": "Job created by Hyperscale Masking",
"feedback_size": 100000,
"stream_row_limit": 10000,
"num_input_streams": 1
},
"source_configs": {
"writer_type": "pyspark",
"unload_spark_driver_memory": "4g",
"unload_spark_driver_cores": 2,
"unload_spark_log_level": "INFO",
"max_worker_threads_per_job": 512,
"staging_push": false,
"user_provided_format_files": false,
"skip_pyspark_resource_validation": false
},
"target_configs": {
"writer_type": "pyspark",
"load_spark_driver_memory": "4g",
"load_spark_driver_cores": 2,
"load_spark_log_level": "INFO",
"staging_push": false
}
}
File Connector Job response:
{
"id": 1,
"name": "job_1",
"masking_engine_ids": [
1
],
"data_set_id": 1,
"app_name_prefix": "app",
"env_name_prefix": "env",
"consider_continuous_compliance_warning_event_as": "SUCCESS",
"retain_execution_data": "NO",
"masking_job_config": {
"max_memory": 1024,
"min_memory": 0,
"description": "Job created by Hyperscale Masking",
"feedback_size": 100000,
"stream_row_limit": 10000,
"num_input_streams": 1
},
"source_configs": {
"writer_type": "pyspark",
"unload_spark_driver_memory": "4g",
"unload_spark_driver_cores": 2,
"unload_spark_log_level": "INFO",
"max_worker_threads_per_job": 512,
"staging_push": false,
"user_provided_format_files": false,
"skip_pyspark_resource_validation": false
},
"target_configs": {
"writer_type": "pyspark",
"load_spark_driver_memory": "4g",
"load_spark_driver_cores": 2,
"load_spark_log_level": "INFO",
"staging_push": false
}
}
MongoDB Job request:
The MongoDB job does not require the source_configs and target_configs objects.
{
"name": "job_1",
"masking_engine_ids": [
1
],
"data_set_id": 1,
"app_name_prefix": "app",
"env_name_prefix": "env",
"consider_continuous_compliance_warning_event_as": "SUCCESS",
"retain_execution_data": "NO",
"masking_job_config": {
"max_memory": 1024,
"min_memory": 0,
"description": "Job created by MongoDB Hyperscale Masking",
"feedback_size": 100000,
"stream_row_limit": 10000,
"num_input_streams": 1
}
}
MongoDB Job response:
{
"name": "job_1",
"masking_engine_ids": [
1
],
"data_set_id": 1,
"app_name_prefix": "app",
"env_name_prefix": "env",
"consider_continuous_compliance_warning_event_as": "SUCCESS",
"retain_execution_data": "NO",
"masking_job_config": {
"max_memory": 1024,
"min_memory": 0,
"description": "Job created by MongoDB Hyperscale Masking",
"feedback_size": 100000,
"stream_row_limit": 10000,
"num_input_streams": 1
}
}
JobExecution API
POST /executions (Create an execution of a Hyperscale job)
Request:
{
"job_id": 1
}
Response: (Immediate response will be like below. Realtime response can be fetched using GET /executions/{execution_id} endpoint)
{
"id": 124,
"job_id": 38,
"status": "RUNNING",
"create_time": "2023-05-04T12:43:03.444964",
"tasks": [
{
"name": "Unload"
},
{
"name": "Masking"
},
{
"name": "Load"
},
{
"name": "Post Load"
}
]
}
GET /executions/{id}/summary (Returns the job execution by execution id in summarized format)
{
"id": 72,
"job_id": 5,
"status": "SUCCEEDED",
"create_time": "2022-12-18T13:38:43.722917",
"end_time": "2022-12-18T13:43:16.554603",
"total_objects": 4,
"total_rows": 16,
"tasks": [
{
"name": "Unload",
"status": "SUCCEEDED",
"start_time": "2022-12-18T13:38:44.184296",
"end_time": "2022-12-18T13:38:54.972883",
"received_objects": 4,
"succeeded_objects": 4,
"failed_objects": 0,
"processing_objects": 0,
"processed_rows": 16,
"total_rows": 16
},
{
"name": "Masking",
"status": "SUCCEEDED",
"start_time": "2022-12-18T13:38:51.979725",
"end_time": "2022-12-18T13:42:58.569202",
"received_objects": 4,
"succeeded_objects": 4,
"failed_objects": 0,
"processing_objects": 0,
"processed_rows": 16,
"total_rows": 16
},
{
"name": "Load",
"status": "SUCCEEDED",
"start_time": "2022-12-18T13:40:39.350857",
"end_time": "2022-12-18T13:43:12.966492",
"received_objects": 4,
"succeeded_objects": 4,
"failed_objects": 0,
"processing_objects": 0,
"processed_rows": 16,
"total_rows": 16
},
{
"name": "Post Load",
"status": "SUCCEEDED",
"start_time": "2022-12-18T13:43:12.981490",
"end_time": "2022-12-18T13:43:15.764366",
"metadata": [
{
"type": "Constraints",
"total": 20,
"processed": 20,
"status": "SUCCESS",
"start_time": "2022-12-18T13:43:12.981490",
"end_time": "2022-12-18T13:43:15.764366"
},
{
"type": "Indexes",
"total": 10,
"processed": 10,
"status": "SUCCESS",
"start_time": "2022-12-18T13:43:12.981490",
"end_time": "2022-12-18T13:43:15.764366"
},
{
"type": "Triggers",
"total": 5,
"processed": 5,
"status": "SUCCESS",
"start_time": "2022-12-18T13:43:12.981490",
"end_time": "2022-12-18T13:43:15.764366"
}
]
}
]
}
GET /executions/{execution_id} (Returns the job execution by execution_id in the detailed format)
The execution response may initially return an approximate number of rows at the start of execution and provide actual values later during the execution.
Request:
id: 1
Response:
{
"id": 1,
"job_id": 1,
"status": "SUCCEEDED",
"create_time": "2023-04-26T12:34:38.012768",
"end_time": "2023-04-26T12:37:32.410297",
"total_objects": 1,
"total_rows": 499999,
"tasks": [
{
"name": "Unload",
"status": "SUCCEEDED",
"start_time": "2023-04-26T12:34:38.027224",
"end_time": "2023-04-26T12:34:42.435849",
"metadata": [
{
"source_key": "dbo.test_TEMP",
"total_rows": 499999,
"status": "SUCCEEDED",
"unloaded_rows": 499999
}
]
},
{
"name": "Masking",
"status": "SUCCEEDED",
"start_time": "2023-04-26T12:34:40.420073",
"end_time": "2023-04-26T12:35:12.423744",
"metadata": [
{
"source_key": "dbo.test_TEMP",
"total_rows": 499999,
"status": "SUCCEEDED",
"masked_rows": 499999
}
]
},
{
"name": "Load",
"status": "SUCCEEDED",
"start_time": "2023-04-26T12:37:08.482240",
"end_time": "2023-04-26T12:37:22.417561",
"metadata": [
{
"source_key": "dbo.test_TEMP",
"total_rows": 499999,
"status": "SUCCEEDED",
"loaded_rows": 499999
}
]
},
{
"name": "Post Load",
"status": "SUCCEEDED",
"start_time": "2023-04-26T12:37:22.426813",
"end_time": "2023-04-26T12:37:22.814583",
"metadata": [
{
"status": "SUCCEEDED",
"table_set": [
"test_TEMP_Result"
],
"object_details": [
{
"type": "Triggers",
"total": 2,
"processed": 2,
"status": "SUCCEEDED",
"start_time": "2023-04-26T12:35:10.325948",
"end_time": "2023-04-26T12:37:22.804792"
},
{
"type": "Indexes",
"total": 4,
"processed": 4,
"status": "SUCCEEDED",
"start_time": "2023-04-26T12:35:10.325948",
"end_time": "2023-04-26T12:37:22.804792"
},
{
"type": "Constraints",
"total": 5,
"processed": 5,
"status": "SUCCEEDED",
"start_time": "2023-04-26T12:35:10.325948",
"end_time": "2023-04-26T12:37:22.804792"
}
]
}
]
}
]
}
Only in case of execution failure, the below API can be used to restart the execution:
PUT /executions/{execution_id}/restart
(Restart a failed execution).The below API can be used only for manually cleaning up the execution:
DELETE /executions/{execution_id}
(Clean up the execution).