Skip to main content
Skip table of contents

Executing a profiler task

Perform the following steps to execute a profiler task

  1. Add the authorization key generated for the controller service into the profiler UI.

    • Click on the Authorize button and then add the key as follows, "apk <authorization-key>"

  2. Validate all the configured connectors on the Hyperscale controller using the /connector-info GET API endpoint.

  3. If the connector-info contains the AWS credentials, then the response will have the AWS credentials hidden. 
    Example: /connector-info response with AWS credentials:

    CODE
    [
    {
     "source": {
      "type": "AWS",
      "properties": {
        "server": "S3",
        "path": "s3_bucket_source/sub_folder",
        "aws_region": "us-east-1",
        "aws_access_key_id": "AKIA********",
        "aws_secret_access_key": "x2IX********",
        "aws_role_arn": "56436882398"
      }
     },
     "target": {
      "type": "AWS",
      "properties": {
        "server": "S3",
        "path": "s3_bucket_target/sub_folder",
        "aws_region": "us-east-1",
        "aws_access_key_id": "AKIA********",
        "aws_secret_access_key": "x2IX********",
        "aws_role_arn": "56436882398"
      }
     }
    }
    ]

    As the credentials are masked, the profiler will need the credentials independently (in case the IAM role-based authentication is not used or the AWS credentials are not set using the environment variables). 

    Use the /source-credentials/{connectorId} post API endpoint to add the credentials mapped to the connector ID received from the controller.
    POST /source-credentials/{connectorId} - Request JSON

    CODE
    {
        "aws_access_key_id": "AKIAJSJDFJSBSG",
        "aws_secret_access_key": "x2IXHFKDjskdnmldf&kksdfh%jsdf"
    }

    POST /source-credentials/{connectorId} - Response JSON

    CODE
    {
        "connector_id": 1,
        "aws_access_key_id": "AKIA********",
        "aws_secret_access_key": "x2IX********"
    }

    Example: /connector-info response with Hadoop database details:

    CODE
    {
        "id": 1,
        "connectorName": "Hadoop_Hive_connector_hive_16",
        "source": {
          "type": "HADOOP-DB",
          "properties": {
            "server": "hdpserver.dlpxdc.co",
            "database_type": "hive",
            "database_port": "10000",
            "database_name": "default",
            "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
            "protocol": "hdfs"
          }
        },
        "target": {
          "type": "HADOOP-FS",
          "properties": {
            "server": "hdpserver.dlpxdc.co",
            "path": "/targetfiles",
            "port": "8020",
            "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
            "protocol": "hdfs"
          }
        }
      }

    Example: /connector-info response with Hadoop Filesystem details

    CODE
    {
        "id": 1,
        "connectorName": "Hadoop_Hive_connector_hive_16",
        "source": {
          "type": "HADOOP-FS",
          "properties": {
            "server": "hdpserver.dlpxdc.co",
            "path": "/sourcefiles",
            "port": "8020",
            "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
            "protocol": "hdfs"
          }
        },
        "target": {
          "type": "HADOOP-FS",
          "properties": {
            "server": "hdpserver.dlpxdc.co",
            "path": "/targetfiles",
            "port": "8020",
            "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
            "protocol": "hdfs"
          }
        }
      }
  4. The profile sets are essentially a list of all masking algorithms mapped to domain manes which the profiler can assign to columns. No default profile set is created when starting the Parquet Profiler for the first time. To create a default profile set, hit the API endpoint /profile-sets. There should now be a default profile set with ID 1. 
    GET /profile-sets - Response JSON

    CODE
    [
      {
        "exclusions": [
          "_id",
          "_id.oid",
          "$oid",
          "_id.$oid",
          "id"
        ],
        "set_id": 1,
        "date_created": "2023-12-14T12:46:34.686136",
        "name": "DEFAULT",
        "description": "default profiler set",
        "entities": [
          {
            "domain_name": "ZIP",
            "algorithm_name": "dlpx-core:CM Alpha-Numeric",
            "type": "pattern",
            "regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)",
            "meta_context": [
              "zip",
              "code"
            ]
          },
          {
            "domain_name": "CREDIT CARD",
            "algorithm_name": "CreditCard",
            "type": "DL"
          },
          {
            "domain_name": "DOB",
            "algorithm_name": "DateShiftDiscrete",
            "date_format": "yyyy-mm-dd",
            "type": "DL_DT",
            "min_age_years": 18,
            "max_age_years": 100
          },
          {
            "domain_name": "EMAIL",
            "algorithm_name": "dlpx-core:Email SL",
            "type": "DL"
          },
          {
            "domain_name": "IP ADDRESS",
            "algorithm_name": "dlpx-core:CM Alpha-Numeric",
            "type": "DL"
          },
          {
            "domain_name": "ADDRESS",
            "algorithm_name": "AddrLookup",
            "type": "DL"
          },
          {
            "domain_name": "CITY",
            "algorithm_name": "USCitiesLookup",
            "type": "DL"
          },
          {
            "domain_name": "COUNTRY",
            "algorithm_name": "NullValueLookup",
            "type": "DL"
          },
          {
            "domain_name": "FIRST_NAME",
            "algorithm_name": "dlpx-core:FirstName",
            "type": "DL"
          },
          {
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName",
            "type": "DL"
          },
          {
            "domain_name": "FULL_NAME",
            "algorithm_name": "dlpx-core:FullName",
            "type": "DL"
          },
          {
            "domain_name": "TELEPHONE_NO",
            "algorithm_name": "dlpx-core:Phone US",
            "type": "DL"
          },
          {
            "domain_name": "WEB",
            "algorithm_name": "WebURLsLookup",
            "type": "DL"
          },
          {
            "domain_name": "DRIVING_LC",
            "algorithm_name": "DrivingLicenseNoLookup",
            "type": "DL"
          },
          {
            "domain_name": "SSN",
            "algorithm_name": "dlpx-core:CM Alpha-Numeric",
            "type": "DL"
          }
        ],
        "date_last_updated": "2023-12-14T12:46:34.686142"
      }
    ]
  5. Generally, the default profile set should be enough for most use cases. But if you want to map different masking algorithms available in your Delphix Compliance Engine to different domains, you should create your own profile set using the /profile-sets POST API endpoint. To know more about the profile sets available in your Delphix Compliance Engine, visit here.
    POST /profile-sets - Request JSON

    CODE
    {
      "set_id": 2,
      "name": "custom_profile_set",
      "description": "Different Algorithm Mapping",
      "exclusions": [
           "_id",
          "_id.oid",
          "$oid",
          "_id.$oid",
          "id"
      ],
      "entities": [
         {
            "domain_name": "FIRST_NAME",
            "algorithm_name": "dlpx-core:FirstName",
            "type": "DL"
          },
          {
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName",
            "type": "DL"
          }
      ]
    }

    Understanding the profile-set payload parameters:

    1. name: Name of the profile set.

    2. exclusions: List of fields (or column names) to exclude from the discovery.

    3. entities: List of entity types to run discovery:

      1. domain_name: The domain name must exist in the Compliance Engine. Note, any DL type entities Domain Name cannot be modified

      2. algorithm_name: Any available algorithm whether out of the box or custom can be assigned to any entity type

      3. type: These are the following types of entities are allowed:

        1. “DL”: A Deep Learning & NLP based discovery. All DL entities must have their correspondence Domain Name from the table listed here. Example payload:

          CODE
          {
            "domain_name": "CREDIT_CARD",
            "algorithm_name": "CreditCard",
            "type": "DL"
          }
        2. “context”: Where users can provide their list of explicit values for discovery. Example payload:

          CODE
          {
            "domain_name": "TITLE",
            "algorithm_name": "RandomValueLookup",
            "type": "context",
            "list": [
              "Mr.",
              "Mrs.",
              "Ms.",
              "Miss",
              "Madam",
              "Master"
            ]
          }
        3. “pattern”  - The regex-based entity, users can add their regex criteria. Additionally, a list of fields can be supplied to provide further context to support regex discovery. Example payload:

          CODE
          {
            "domain_name": "ZIP_CODE",
            "algorithm_name": "dlpx-core:CM Alpha-Numeric",
            "type": "pattern",
            "regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)",
            "meta_context": [
              "zip",
              "code"
            ]
          }

    POST /profile-sets - Response JSON

    CODE
    {
      "set_id": 2,
      "name": "custom_profile_set",
      "description": "Different Algorithm Mapping",
      "exclusions": [
        "_id",
        "_id.oid",
        "$oid",
        "_id.$oid",
        "id"
      ],
      "entities": [
        {
          "domain_name": "FIRST_NAME",
          "algorithm_name": "dlpx-core:FirstName",
          "type": "DL"
        },
        {
          "domain_name": "LAST_NAME",
          "algorithm_name": "dlpx-core:LastName",
          "type": "DL"
        }
      ]
    }
  6. You can now start a profiler task using the /tasks POST API endpoint.
    Example 1: POST /tasks - Request JSON 

    CODE
    { 
      "connector_id": 1, 
      "set_id": 1, 
      "scan_depth": 1000, 
      "unique_source_files_identifier": "file_identifier", 
      "unload_split": 2, 
      "file_type": "parquet" 
    } 

Example 2: POST /tasks - Request JSON 

CODE
{ 

  "connector_id": 16, 

  "set_id": 1, 

  "scan_depth": 4, 

  "unique_source_files_identifier": "string", 

  "unload_split": 2, 

  "file_type": null, 

  "table_list": [ 

    "people_directory" 

  ] 

} 

Understanding the task payload parameters:

a. connector_id - The connector to get the source details from. The profiler will identify all files (recursively) within the source S3 path provided in the connector-info details.

b. set_id - The profiler set ID that the profiler tasks should run against.

c. scan_depth - The number of (random) rows in the parquet file that need to be analyzed by the profiler to determine what kind of sensitive data it is.

d. unique_source_files_identifier - The source key value that the resultant Hyperscale Parquet Connector dataset should be populated with.

e. unload_split - The unload split that the resultant Hyperscale Parquet Connector dataset should be populated with.

f. file_type - The file type should be “parquet”. 

g. table_list - List of hive tables that will be scanned for profiling.

POST /tasks - Response JSON

CODE
{   "task_id": "11b92f0f-7c08-4768-97c5-17ce73213dc8",   "status": "RUNNING" }
  1. The status of the task can be monitored using the /tasks/{id} GET API endpoint.

  2. Once the status shows “SUCCESS”, the Hyperscale Parquet Connector dataset generated by the profiler is shown as part of the results.
    Example 1: GET /tasks/{id} - Response JSON

    CODE
    {
      "task_id": "11b92f0f-7c08-4768-97c5-17ce73213dc8",
      "connector_id": 1,
      "data_set_id": null,
      "status": "SUCCESS",
      "set_id": 1,
      "scan_depth": 100,
      "file_type": "parquet",
      "unique_source_files_identifier": "file_identifier",
      "unload_split": 2,
      "results": {
        "connector_id": 1,
        "data_info": [
          {
            "source": {
              "unique_source_files_identifier": "file_identifier_1",
              "file_type": "parquet",
              "unload_split": 2,
              "source_files": [
                "customer/part-00000.gz.parquet",
                "customer/part-00001.gz.parquet",
                "customer/part-00002.gz.parquet",
                "customer/part-00003.gz.parquet",
                "customer/part-00004.gz.parquet",
                "customer/part-00005.gz.parquet",
                "customer/part-00006.gz.parquet",
                "customer/part-00007.gz.parquet",
                "customer/part-00008.gz.parquet",
                "customer/part-00009.gz.parquet"
              ]
            },
            "target": {
              "perform_join": true
            },
            "masking_inventory": [
              {
                "field_name": "c_last",
                "domain_name": "FIRST_NAME",
                "algorithm_name": "dlpx-core:FirstName"
              },
              {
                "field_name": "c_state",
                "domain_name": "LAST_NAME",
                "algorithm_name": "dlpx-core:LastName"
              },
              {
                "field_name": "c_phone",
                "domain_name": "TELEPHONE_NO",
                "algorithm_name": "dlpx-core:Phone US"
              }
            ]
          },
          {
            "source": {
              "unique_source_files_identifier": "file_identifier_2",
              "file_type": "parquet",
              "unload_split": 2,
              "source_files": [
                "district/part-00000.gz.parquet"
              ]
            },
            "target": {
              "perform_join": true
            },
            "masking_inventory": [
              {
                "field_name": "d_name",
                "domain_name": "LAST_NAME",
                "algorithm_name": "dlpx-core:LastName"
              },
              {
                "field_name": "d_street_2",
                "domain_name": "LAST_NAME",
                "algorithm_name": "dlpx-core:LastName"
              },
              {
                "field_name": "d_state",
                "domain_name": "LAST_NAME",
                "algorithm_name": "dlpx-core:LastName"
              }
            ]
          },
          {
            "source": {
              "unique_source_files_identifier": "file_identifier_7",
              "file_type": "parquet",
              "unload_split": 2,
              "source_files": [
                "orders/part-00000.gz.parquet",
                "orders/part-00001.gz.parquet",
                "orders/part-00002.gz.parquet",
                "orders/part-00003.gz.parquet",
                "orders/part-00004.gz.parquet"
              ]
            },
            "target": {
              "perform_join": true
            },
            "masking_inventory": [
              {
                "field_name": "o_id",
                "domain_name": "TELEPHONE_NO",
                "algorithm_name": "dlpx-core:Phone US"
              }
            ]
          },
          {
            "source": {
              "unique_source_files_identifier": "file_identifier_9",
              "file_type": "parquet",
              "unload_split": 2,
              "source_files": [
                "warehouse/part-00000.gz.parquet"
              ]
            },
            "target": {
              "perform_join": true
            },
            "masking_inventory": [
              {
                "field_name": "w_name",
                "domain_name": "CITY",
                "algorithm_name": "USCitiesLookup"
              },
              {
                "field_name": "w_street_1",
                "domain_name": "ZIP",
                "algorithm_name": "dlpx-core:CM Alpha-Numeric"
              },
              {
                "field_name": "w_state",
                "domain_name": "LAST_NAME",
                "algorithm_name": "dlpx-core:LastName"
              },
              {
                "field_name": "w_zip",
                "domain_name": "LAST_NAME",
                "algorithm_name": "dlpx-core:LastName"
              }
            ]
          }
        ]
      },
      "total": 16,
      "identified": null,
      "completion": 100,
      "elapsed_time": "0:06:47.837970",
      "start_time": "2023-12-14T13:32:18.913943",
      "end_time": "2023-12-14T13:39:06.756026",
      "date_created": "2023-12-14T13:32:18.913948",
      "date_last_updated": "2023-12-14T13:32:18.913950"
    }

Example 2: GET /tasks/{id} - Response JSON

CODE
{
  "task_id": "7b891bdb-0bd9-455b-a27a-eeb5eec0d5b6",
  "connector_id": 16,
  "data_set_id": null,
  "status": "SUCCESS",
  "set_id": 1,
  "scan_depth": 4,
  "file_type": null,
  "unique_source_files_identifier": "string",
  "unload_split": 2,
  "results": {
    "connector_id": 16,
    "data_info": [
      {
        "source": {
          "unique_source_files_identifier": "string_1",
          "file_type": "parquet",
          "unload_split": 2,
          "source_files": [
            "people_directory"
          ]
        },
        "target": {
          "perform_join": true
        },
        "masking_inventory": [
          {
            "field_name": "first_name",
            "domain_name": "FIRST_NAME",
            "algorithm_name": "dlpx-core:FirstName"
          },
          {
            "field_name": "last_name",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          }
        ]
      }
    ]
  },
  "total": 1,
  "identified": 1,
  "completion": 100,
  "elapsed_time": "0:00:18.952383",
  "start_time": "2024-08-01T20:52:13.397608",
  "end_time": "2024-08-01T20:52:32.375221",
  "date_created": "2024-08-01T20:52:13.397619",
  "date_last_updated": "2024-08-01T20:52:13.397622"
}
  1. You can push the generated dataset directly from the profiler using the /data-sets/{task_id} POST API endpoint. The response contains the ID of the newly created dataset on the controller.
    POST /data-sets/{task_id} - Response JSON

    CODE
    {
      "data_set_id": 1
    }

The DL entities within the default Profiler-Set with their algorithms

Type

Domain Name

Algorithm

Description

DL

FULL_NAME

dlpx-core:FullName

Full name detection 

DL

FIRST_NAME

dlpx-core:FirstName

First name

DL

LAST_NAME

dlpx-core:LastName

Last name

DL

EMAIL

dlpx-core:Email SL

Email address

DL

TELEPHONE_NO

dlpx-core:Phone US

Phone or Mobile number

DL

DOB

DateShiftDiscrete

Date of Birth

DL

IP ADDRESS

dlpx-core:CM Alpha-Numeric

IP Address

DL

CREDIT CARD

CreditCard

Credit Card

DL

ADDRESS

AddrLookup

Street Address

DL

CITY

USCitiesLookup

City name

DL

COUNTRY

NullValueLookup

Country name

DL

WEB

WebURLsLookup

URL or domain name

DL

DRIVING_LC

DrivingLicenseNoLookup

US driving license

DL

SSN

dlpx-core:CM Alpha-Numeric

Social Security Number

The other available DL entities

Type

Domain Name

Description

DL

STATE

State name

DL

STATE_CODE

State Code

DL

CRYPTO

Bitcoin address

DL

IBAN_CODE

The International Bank Account Number (IBAN)

DL

US_BANK_NUMBER

A US bank account number is between 8 to 17 digits.

DL

US_ITIN

US Individual Taxpayer Identification Number (ITIN

DL

US_PASSPORT

A US passport number with 9 digits

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.