Executing a profiler task
Perform the following steps to execute a profiler task
Add the authorization key generated for the controller service into the profiler UI.
Click on the Authorize button and then add the key as follows,
"apk <authorization-key>"
Validate all the configured connectors on the Hyperscale controller using the
/connector-info
GET API endpoint.If the connector-info contains the AWS credentials, then the response will have the AWS credentials hidden.
Example: /connector-info response with AWS credentials:CODE[ { "source": { "type": "AWS", "properties": { "server": "S3", "path": "s3_bucket_source/sub_folder", "aws_region": "us-east-1", "aws_access_key_id": "AKIA********", "aws_secret_access_key": "x2IX********", "aws_role_arn": "56436882398" } }, "target": { "type": "AWS", "properties": { "server": "S3", "path": "s3_bucket_target/sub_folder", "aws_region": "us-east-1", "aws_access_key_id": "AKIA********", "aws_secret_access_key": "x2IX********", "aws_role_arn": "56436882398" } } } ]
As the credentials are masked, the profiler will need the credentials independently (in case the IAM role-based authentication is not used or the AWS credentials are not set using the environment variables).
Use the
/source-credentials/{connectorId}
post API endpoint to add the credentials mapped to the connector ID received from the controller.
POST /source-credentials/{connectorId} - Request JSONCODE{ "aws_access_key_id": "AKIAJSJDFJSBSG", "aws_secret_access_key": "x2IXHFKDjskdnmldf&kksdfh%jsdf" }
POST /source-credentials/{connectorId} - Response JSON
CODE{ "connector_id": 1, "aws_access_key_id": "AKIA********", "aws_secret_access_key": "x2IX********" }
Example: /connector-info response with Hadoop database details:
CODE{ "id": 1, "connectorName": "Hadoop_Hive_connector_hive_16", "source": { "type": "HADOOP-DB", "properties": { "server": "hdpserver.dlpxdc.co", "database_type": "hive", "database_port": "10000", "database_name": "default", "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO", "protocol": "hdfs" } }, "target": { "type": "HADOOP-FS", "properties": { "server": "hdpserver.dlpxdc.co", "path": "/targetfiles", "port": "8020", "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO", "protocol": "hdfs" } } }
Example: /connector-info response with Hadoop Filesystem details
CODE{ "id": 1, "connectorName": "Hadoop_Hive_connector_hive_16", "source": { "type": "HADOOP-FS", "properties": { "server": "hdpserver.dlpxdc.co", "path": "/sourcefiles", "port": "8020", "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO", "protocol": "hdfs" } }, "target": { "type": "HADOOP-FS", "properties": { "server": "hdpserver.dlpxdc.co", "path": "/targetfiles", "port": "8020", "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO", "protocol": "hdfs" } } }
The profile sets are essentially a list of all masking algorithms mapped to domain manes which the profiler can assign to columns. No default profile set is created when starting the Parquet Profiler for the first time. To create a default profile set, hit the API endpoint
/profile-sets
. There should now be a default profile set with ID 1.
GET /profile-sets - Response JSONCODE[ { "exclusions": [ "_id", "_id.oid", "$oid", "_id.$oid", "id" ], "set_id": 1, "date_created": "2023-12-14T12:46:34.686136", "name": "DEFAULT", "description": "default profiler set", "entities": [ { "domain_name": "ZIP", "algorithm_name": "dlpx-core:CM Alpha-Numeric", "type": "pattern", "regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)", "meta_context": [ "zip", "code" ] }, { "domain_name": "CREDIT CARD", "algorithm_name": "CreditCard", "type": "DL" }, { "domain_name": "DOB", "algorithm_name": "DateShiftDiscrete", "date_format": "yyyy-mm-dd", "type": "DL_DT", "min_age_years": 18, "max_age_years": 100 }, { "domain_name": "EMAIL", "algorithm_name": "dlpx-core:Email SL", "type": "DL" }, { "domain_name": "IP ADDRESS", "algorithm_name": "dlpx-core:CM Alpha-Numeric", "type": "DL" }, { "domain_name": "ADDRESS", "algorithm_name": "AddrLookup", "type": "DL" }, { "domain_name": "CITY", "algorithm_name": "USCitiesLookup", "type": "DL" }, { "domain_name": "COUNTRY", "algorithm_name": "NullValueLookup", "type": "DL" }, { "domain_name": "FIRST_NAME", "algorithm_name": "dlpx-core:FirstName", "type": "DL" }, { "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName", "type": "DL" }, { "domain_name": "FULL_NAME", "algorithm_name": "dlpx-core:FullName", "type": "DL" }, { "domain_name": "TELEPHONE_NO", "algorithm_name": "dlpx-core:Phone US", "type": "DL" }, { "domain_name": "WEB", "algorithm_name": "WebURLsLookup", "type": "DL" }, { "domain_name": "DRIVING_LC", "algorithm_name": "DrivingLicenseNoLookup", "type": "DL" }, { "domain_name": "SSN", "algorithm_name": "dlpx-core:CM Alpha-Numeric", "type": "DL" } ], "date_last_updated": "2023-12-14T12:46:34.686142" } ]
Generally, the default profile set should be enough for most use cases. But if you want to map different masking algorithms available in your Delphix Compliance Engine to different domains, you should create your own profile set using the
/profile-sets
POST API endpoint. To know more about the profile sets available in your Delphix Compliance Engine, visit here.
POST /profile-sets - Request JSONCODE{ "set_id": 2, "name": "custom_profile_set", "description": "Different Algorithm Mapping", "exclusions": [ "_id", "_id.oid", "$oid", "_id.$oid", "id" ], "entities": [ { "domain_name": "FIRST_NAME", "algorithm_name": "dlpx-core:FirstName", "type": "DL" }, { "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName", "type": "DL" } ] }
Understanding the profile-set payload parameters:
name: Name of the profile set.
exclusions: List of fields (or column names) to exclude from the discovery.
entities: List of entity types to run discovery:
domain_name: The domain name must exist in the Compliance Engine. Note, any DL type entities Domain Name cannot be modified
algorithm_name: Any available algorithm whether out of the box or custom can be assigned to any entity type
type: These are the following types of entities are allowed:
“DL”: A Deep Learning & NLP based discovery. All DL entities must have their correspondence Domain Name from the table listed here. Example payload:
CODE{ "domain_name": "CREDIT_CARD", "algorithm_name": "CreditCard", "type": "DL" }
“context”: Where users can provide their list of explicit values for discovery. Example payload:
CODE{ "domain_name": "TITLE", "algorithm_name": "RandomValueLookup", "type": "context", "list": [ "Mr.", "Mrs.", "Ms.", "Miss", "Madam", "Master" ] }
“pattern” - The regex-based entity, users can add their regex criteria. Additionally, a list of fields can be supplied to provide further context to support regex discovery. Example payload:
CODE{ "domain_name": "ZIP_CODE", "algorithm_name": "dlpx-core:CM Alpha-Numeric", "type": "pattern", "regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)", "meta_context": [ "zip", "code" ] }
POST /profile-sets - Response JSON
CODE{ "set_id": 2, "name": "custom_profile_set", "description": "Different Algorithm Mapping", "exclusions": [ "_id", "_id.oid", "$oid", "_id.$oid", "id" ], "entities": [ { "domain_name": "FIRST_NAME", "algorithm_name": "dlpx-core:FirstName", "type": "DL" }, { "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName", "type": "DL" } ] }
You can now start a profiler task using the
/tasks
POST API endpoint.
Example 1: POST /tasks - Request JSONCODE{ "connector_id": 1, "set_id": 1, "scan_depth": 1000, "unique_source_files_identifier": "file_identifier", "unload_split": 2, "file_type": "parquet" }
Example 2: POST /tasks - Request JSON
{
"connector_id": 16,
"set_id": 1,
"scan_depth": 4,
"unique_source_files_identifier": "string",
"unload_split": 2,
"file_type": null,
"table_list": [
"people_directory"
]
}
Understanding the task payload parameters:
a. connector_id - The connector to get the source details from. The profiler will identify all files (recursively) within the source S3 path provided in the connector-info details.
b. set_id - The profiler set ID that the profiler tasks should run against.
c. scan_depth - The number of (random) rows in the parquet file that need to be analyzed by the profiler to determine what kind of sensitive data it is.
d. unique_source_files_identifier - The source key value that the resultant Hyperscale Parquet Connector dataset should be populated with.
e. unload_split - The unload split that the resultant Hyperscale Parquet Connector dataset should be populated with.
f. file_type - The file type should be “parquet”.
g. table_list - List of hive tables that will be scanned for profiling.
POST /tasks - Response JSON
{ "task_id": "11b92f0f-7c08-4768-97c5-17ce73213dc8", "status": "RUNNING" }
The status of the task can be monitored using the
/tasks/{id}
GET API endpoint.Once the status shows “SUCCESS”, the Hyperscale Parquet Connector dataset generated by the profiler is shown as part of the results.
Example 1: GET /tasks/{id} - Response JSONCODE{ "task_id": "11b92f0f-7c08-4768-97c5-17ce73213dc8", "connector_id": 1, "data_set_id": null, "status": "SUCCESS", "set_id": 1, "scan_depth": 100, "file_type": "parquet", "unique_source_files_identifier": "file_identifier", "unload_split": 2, "results": { "connector_id": 1, "data_info": [ { "source": { "unique_source_files_identifier": "file_identifier_1", "file_type": "parquet", "unload_split": 2, "source_files": [ "customer/part-00000.gz.parquet", "customer/part-00001.gz.parquet", "customer/part-00002.gz.parquet", "customer/part-00003.gz.parquet", "customer/part-00004.gz.parquet", "customer/part-00005.gz.parquet", "customer/part-00006.gz.parquet", "customer/part-00007.gz.parquet", "customer/part-00008.gz.parquet", "customer/part-00009.gz.parquet" ] }, "target": { "perform_join": true }, "masking_inventory": [ { "field_name": "c_last", "domain_name": "FIRST_NAME", "algorithm_name": "dlpx-core:FirstName" }, { "field_name": "c_state", "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName" }, { "field_name": "c_phone", "domain_name": "TELEPHONE_NO", "algorithm_name": "dlpx-core:Phone US" } ] }, { "source": { "unique_source_files_identifier": "file_identifier_2", "file_type": "parquet", "unload_split": 2, "source_files": [ "district/part-00000.gz.parquet" ] }, "target": { "perform_join": true }, "masking_inventory": [ { "field_name": "d_name", "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName" }, { "field_name": "d_street_2", "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName" }, { "field_name": "d_state", "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName" } ] }, { "source": { "unique_source_files_identifier": "file_identifier_7", "file_type": "parquet", "unload_split": 2, "source_files": [ "orders/part-00000.gz.parquet", "orders/part-00001.gz.parquet", "orders/part-00002.gz.parquet", "orders/part-00003.gz.parquet", "orders/part-00004.gz.parquet" ] }, "target": { "perform_join": true }, "masking_inventory": [ { "field_name": "o_id", "domain_name": "TELEPHONE_NO", "algorithm_name": "dlpx-core:Phone US" } ] }, { "source": { "unique_source_files_identifier": "file_identifier_9", "file_type": "parquet", "unload_split": 2, "source_files": [ "warehouse/part-00000.gz.parquet" ] }, "target": { "perform_join": true }, "masking_inventory": [ { "field_name": "w_name", "domain_name": "CITY", "algorithm_name": "USCitiesLookup" }, { "field_name": "w_street_1", "domain_name": "ZIP", "algorithm_name": "dlpx-core:CM Alpha-Numeric" }, { "field_name": "w_state", "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName" }, { "field_name": "w_zip", "domain_name": "LAST_NAME", "algorithm_name": "dlpx-core:LastName" } ] } ] }, "total": 16, "identified": null, "completion": 100, "elapsed_time": "0:06:47.837970", "start_time": "2023-12-14T13:32:18.913943", "end_time": "2023-12-14T13:39:06.756026", "date_created": "2023-12-14T13:32:18.913948", "date_last_updated": "2023-12-14T13:32:18.913950" }
Example 2: GET /tasks/{id} - Response JSON
{
"task_id": "7b891bdb-0bd9-455b-a27a-eeb5eec0d5b6",
"connector_id": 16,
"data_set_id": null,
"status": "SUCCESS",
"set_id": 1,
"scan_depth": 4,
"file_type": null,
"unique_source_files_identifier": "string",
"unload_split": 2,
"results": {
"connector_id": 16,
"data_info": [
{
"source": {
"unique_source_files_identifier": "string_1",
"file_type": "parquet",
"unload_split": 2,
"source_files": [
"people_directory"
]
},
"target": {
"perform_join": true
},
"masking_inventory": [
{
"field_name": "first_name",
"domain_name": "FIRST_NAME",
"algorithm_name": "dlpx-core:FirstName"
},
{
"field_name": "last_name",
"domain_name": "LAST_NAME",
"algorithm_name": "dlpx-core:LastName"
}
]
}
]
},
"total": 1,
"identified": 1,
"completion": 100,
"elapsed_time": "0:00:18.952383",
"start_time": "2024-08-01T20:52:13.397608",
"end_time": "2024-08-01T20:52:32.375221",
"date_created": "2024-08-01T20:52:13.397619",
"date_last_updated": "2024-08-01T20:52:13.397622"
}
You can push the generated dataset directly from the profiler using the
/data-sets/{task_id}
POST API endpoint. The response contains the ID of the newly created dataset on the controller.
POST /data-sets/{task_id} - Response JSONCODE{ "data_set_id": 1 }
The DL entities within the default Profiler-Set with their algorithms
Type | Domain Name | Algorithm | Description |
DL | FULL_NAME | dlpx-core:FullName | Full name detection |
DL | FIRST_NAME | dlpx-core:FirstName | First name |
DL | LAST_NAME | dlpx-core:LastName | Last name |
DL | dlpx-core:Email SL | Email address | |
DL | TELEPHONE_NO | dlpx-core:Phone US | Phone or Mobile number |
DL | DOB | DateShiftDiscrete | Date of Birth |
DL | IP ADDRESS | dlpx-core:CM Alpha-Numeric | IP Address |
DL | CREDIT CARD | CreditCard | Credit Card |
DL | ADDRESS | AddrLookup | Street Address |
DL | CITY | USCitiesLookup | City name |
DL | COUNTRY | NullValueLookup | Country name |
DL | WEB | WebURLsLookup | URL or domain name |
DL | DRIVING_LC | DrivingLicenseNoLookup | US driving license |
DL | SSN | dlpx-core:CM Alpha-Numeric | Social Security Number |
The other available DL entities
Type | Domain Name | Description |
DL | STATE | State name |
DL | STATE_CODE | State Code |
DL | CRYPTO | Bitcoin address |
DL | IBAN_CODE | The International Bank Account Number (IBAN) |
DL | US_BANK_NUMBER | A US bank account number is between 8 to 17 digits. |
DL | US_ITIN | US Individual Taxpayer Identification Number (ITIN |
DL | US_PASSPORT | A US passport number with 9 digits |