Executing a profiler task

Perform the following steps to execute a profiler task

Add the authorization key generated for the controller service into the profiler UI.
- Click on the Authorize button and then add the key as follows, "apk <authorization-key>"
Validate all the configured connectors on the Hyperscale controller using the /connector-info GET API endpoint.

If the connector-info contains the AWS credentials, then the response will have the AWS credentials hidden.
Example: /connector-info response with AWS credentials:

CODE

[
{
 "source": {
  "type": "AWS",
  "properties": {
    "server": "S3",
    "path": "s3_bucket_source/sub_folder",
    "aws_region": "us-east-1",
    "aws_access_key_id": "AKIA********",
    "aws_secret_access_key": "x2IX********",
    "aws_role_arn": "56436882398"
  }
 },
 "target": {
  "type": "AWS",
  "properties": {
    "server": "S3",
    "path": "s3_bucket_target/sub_folder",
    "aws_region": "us-east-1",
    "aws_access_key_id": "AKIA********",
    "aws_secret_access_key": "x2IX********",
    "aws_role_arn": "56436882398"
  }
 }
}
]

As the credentials are masked, the profiler will need the credentials independently (in case the IAM role-based authentication is not used or the AWS credentials are not set using the environment variables).

Use the /source-credentials/{connectorId} post API endpoint to add the credentials mapped to the connector ID received from the controller.
POST /source-credentials/{connectorId} - Request JSON

CODE

{
    "aws_access_key_id": "AKIAJSJDFJSBSG",
    "aws_secret_access_key": "x2IXHFKDjskdnmldf&kksdfh%jsdf"
}

POST /source-credentials/{connectorId} - Response JSON

CODE

{
    "connector_id": 1,
    "aws_access_key_id": "AKIA********",
    "aws_secret_access_key": "x2IX********"
}

Example: /connector-info response with Hadoop database details:

CODE

{
    "id": 1,
    "connectorName": "Hadoop_Hive_connector_hive_16",
    "source": {
      "type": "HADOOP-DB",
      "properties": {
        "server": "hdpserver.dlpxdc.co",
        "database_type": "hive",
        "database_port": "10000",
        "database_name": "default",
        "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
        "protocol": "hdfs"
      }
    },
    "target": {
      "type": "HADOOP-FS",
      "properties": {
        "server": "hdpserver.dlpxdc.co",
        "path": "/targetfiles",
        "port": "8020",
        "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
        "protocol": "hdfs"
      }
    }
  }

Example: /connector-info response with Hadoop Filesystem details

CODE

{
    "id": 1,
    "connectorName": "Hadoop_Hive_connector_hive_16",
    "source": {
      "type": "HADOOP-FS",
      "properties": {
        "server": "hdpserver.dlpxdc.co",
        "path": "/sourcefiles",
        "port": "8020",
        "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
        "protocol": "hdfs"
      }
    },
    "target": {
      "type": "HADOOP-FS",
      "properties": {
        "server": "hdpserver.dlpxdc.co",
        "path": "/targetfiles",
        "port": "8020",
        "principal_name": "hive/hdpserver.dlpxdc.co@DLPXDC.CO",
        "protocol": "hdfs"
      }
    }
  }

The profile sets are essentially a list of all masking algorithms mapped to domain manes which the profiler can assign to columns. No default profile set is created when starting the Parquet Profiler for the first time. To create a default profile set, hit the API endpoint /profile-sets. There should now be a default profile set with ID 1.
GET /profile-sets - Response JSON

CODE

[
  {
    "exclusions": [
      "_id",
      "_id.oid",
      "$oid",
      "_id.$oid",
      "id"
    ],
    "set_id": 1,
    "date_created": "2023-12-14T12:46:34.686136",
    "name": "DEFAULT",
    "description": "default profiler set",
    "entities": [
      {
        "domain_name": "ZIP",
        "algorithm_name": "dlpx-core:CM Alpha-Numeric",
        "type": "pattern",
        "regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)",
        "meta_context": [
          "zip",
          "code"
        ]
      },
      {
        "domain_name": "CREDIT CARD",
        "algorithm_name": "CreditCard",
        "type": "DL"
      },
      {
        "domain_name": "DOB",
        "algorithm_name": "DateShiftDiscrete",
        "date_format": "yyyy-mm-dd",
        "type": "DL_DT",
        "min_age_years": 18,
        "max_age_years": 100
      },
      {
        "domain_name": "EMAIL",
        "algorithm_name": "dlpx-core:Email SL",
        "type": "DL"
      },
      {
        "domain_name": "IP ADDRESS",
        "algorithm_name": "dlpx-core:CM Alpha-Numeric",
        "type": "DL"
      },
      {
        "domain_name": "ADDRESS",
        "algorithm_name": "AddrLookup",
        "type": "DL"
      },
      {
        "domain_name": "CITY",
        "algorithm_name": "USCitiesLookup",
        "type": "DL"
      },
      {
        "domain_name": "COUNTRY",
        "algorithm_name": "NullValueLookup",
        "type": "DL"
      },
      {
        "domain_name": "FIRST_NAME",
        "algorithm_name": "dlpx-core:FirstName",
        "type": "DL"
      },
      {
        "domain_name": "LAST_NAME",
        "algorithm_name": "dlpx-core:LastName",
        "type": "DL"
      },
      {
        "domain_name": "FULL_NAME",
        "algorithm_name": "dlpx-core:FullName",
        "type": "DL"
      },
      {
        "domain_name": "TELEPHONE_NO",
        "algorithm_name": "dlpx-core:Phone US",
        "type": "DL"
      },
      {
        "domain_name": "WEB",
        "algorithm_name": "WebURLsLookup",
        "type": "DL"
      },
      {
        "domain_name": "DRIVING_LC",
        "algorithm_name": "DrivingLicenseNoLookup",
        "type": "DL"
      },
      {
        "domain_name": "SSN",
        "algorithm_name": "dlpx-core:CM Alpha-Numeric",
        "type": "DL"
      }
    ],
    "date_last_updated": "2023-12-14T12:46:34.686142"
  }
]

Generally, the default profile set should be enough for most use cases. But if you want to map different masking algorithms available in your Delphix Compliance Engine to different domains, you should create your own profile set using the /profile-sets POST API endpoint. To know more about the profile sets available in your Delphix Compliance Engine, visit here.
POST /profile-sets - Request JSON
CODE
```
{
  "set_id": 2,
  "name": "custom_profile_set",
  "description": "Different Algorithm Mapping",
  "exclusions": [
       "_id",
      "_id.oid",
      "$oid",
      "_id.$oid",
      "id"
  ],
  "entities": [
     {
        "domain_name": "FIRST_NAME",
        "algorithm_name": "dlpx-core:FirstName",
        "type": "DL"
      },
      {
        "domain_name": "LAST_NAME",
        "algorithm_name": "dlpx-core:LastName",
        "type": "DL"
      }
  ]
}
```
Understanding the profile-set payload parameters:
1. name: Name of the profile set.
2. exclusions: List of fields (or column names) to exclude from the discovery.
3. entities: List of entity types to run discovery:
  1. domain_name: The domain name must exist in the Compliance Engine. Note, any DL type entities Domain Name cannot be modified
  2. algorithm_name: Any available algorithm whether out of the box or custom can be assigned to any entity type
  3. type: These are the following types of entities are allowed:
    1. “DL”: A Deep Learning & NLP based discovery. All DL entities must have their correspondence Domain Name from the table listed here. Example payload:
      CODE
      
      { "domain_name": "CREDIT_CARD", "algorithm_name": "CreditCard", "type": "DL" }
    2. “context”: Where users can provide their list of explicit values for discovery. Example payload:
      CODE
      
      { "domain_name": "TITLE", "algorithm_name": "RandomValueLookup", "type": "context", "list": [ "Mr.", "Mrs.", "Ms.", "Miss", "Madam", "Master" ] }
    3. “pattern” - The regex-based entity, users can add their regex criteria. Additionally, a list of fields can be supplied to provide further context to support regex discovery. Example payload:
      CODE
      
      { "domain_name": "ZIP_CODE", "algorithm_name": "dlpx-core:CM Alpha-Numeric", "type": "pattern", "regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)", "meta_context": [ "zip", "code" ] }
POST /profile-sets - Response JSON
CODE
```
{
  "set_id": 2,
  "name": "custom_profile_set",
  "description": "Different Algorithm Mapping",
  "exclusions": [
    "_id",
    "_id.oid",
    "$oid",
    "_id.$oid",
    "id"
  ],
  "entities": [
    {
      "domain_name": "FIRST_NAME",
      "algorithm_name": "dlpx-core:FirstName",
      "type": "DL"
    },
    {
      "domain_name": "LAST_NAME",
      "algorithm_name": "dlpx-core:LastName",
      "type": "DL"
    }
  ]
}
```

You can now start a profiler task using the /tasks POST API endpoint.
Example 1: POST /tasks - Request JSON

CODE

{ 
  "connector_id": 1, 
  "set_id": 1, 
  "scan_depth": 1000, 
  "unique_source_files_identifier": "file_identifier", 
  "unload_split": 2, 
  "file_type": "parquet" 
}

Example 2: POST /tasks - Request JSON

CODE

{ 

  "connector_id": 16, 

  "set_id": 1, 

  "scan_depth": 4, 

  "unique_source_files_identifier": "string", 

  "unload_split": 2, 

  "file_type": null, 

  "table_list": [ 

    "people_directory" 

  ] 

}

Understanding the task payload parameters:

a. connector_id - The connector to get the source details from. The profiler will identify all files (recursively) within the source S3 path provided in the connector-info details.

b. set_id - The profiler set ID that the profiler tasks should run against.

c. scan_depth - The number of (random) rows in the parquet file that need to be analyzed by the profiler to determine what kind of sensitive data it is.

d. unique_source_files_identifier - The source key value that the resultant Hyperscale Parquet Connector dataset should be populated with.

e. unload_split - The unload split that the resultant Hyperscale Parquet Connector dataset should be populated with.

f. file_type - The file type should be “parquet”.

g. table_list - List of hive tables that will be scanned for profiling.

POST /tasks - Response JSON

CODE

{   "task_id": "11b92f0f-7c08-4768-97c5-17ce73213dc8",   "status": "RUNNING" }

The status of the task can be monitored using the /tasks/{id} GET API endpoint.

Once the status shows “SUCCESS”, the Hyperscale Parquet Connector dataset generated by the profiler is shown as part of the results.
Example 1: GET /tasks/{id} - Response JSON

CODE

{
  "task_id": "11b92f0f-7c08-4768-97c5-17ce73213dc8",
  "connector_id": 1,
  "data_set_id": null,
  "status": "SUCCESS",
  "set_id": 1,
  "scan_depth": 100,
  "file_type": "parquet",
  "unique_source_files_identifier": "file_identifier",
  "unload_split": 2,
  "results": {
    "connector_id": 1,
    "data_info": [
      {
        "source": {
          "unique_source_files_identifier": "file_identifier_1",
          "file_type": "parquet",
          "unload_split": 2,
          "source_files": [
            "customer/part-00000.gz.parquet",
            "customer/part-00001.gz.parquet",
            "customer/part-00002.gz.parquet",
            "customer/part-00003.gz.parquet",
            "customer/part-00004.gz.parquet",
            "customer/part-00005.gz.parquet",
            "customer/part-00006.gz.parquet",
            "customer/part-00007.gz.parquet",
            "customer/part-00008.gz.parquet",
            "customer/part-00009.gz.parquet"
          ]
        },
        "target": {
          "perform_join": true
        },
        "masking_inventory": [
          {
            "field_name": "c_last",
            "domain_name": "FIRST_NAME",
            "algorithm_name": "dlpx-core:FirstName"
          },
          {
            "field_name": "c_state",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          },
          {
            "field_name": "c_phone",
            "domain_name": "TELEPHONE_NO",
            "algorithm_name": "dlpx-core:Phone US"
          }
        ]
      },
      {
        "source": {
          "unique_source_files_identifier": "file_identifier_2",
          "file_type": "parquet",
          "unload_split": 2,
          "source_files": [
            "district/part-00000.gz.parquet"
          ]
        },
        "target": {
          "perform_join": true
        },
        "masking_inventory": [
          {
            "field_name": "d_name",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          },
          {
            "field_name": "d_street_2",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          },
          {
            "field_name": "d_state",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          }
        ]
      },
      {
        "source": {
          "unique_source_files_identifier": "file_identifier_7",
          "file_type": "parquet",
          "unload_split": 2,
          "source_files": [
            "orders/part-00000.gz.parquet",
            "orders/part-00001.gz.parquet",
            "orders/part-00002.gz.parquet",
            "orders/part-00003.gz.parquet",
            "orders/part-00004.gz.parquet"
          ]
        },
        "target": {
          "perform_join": true
        },
        "masking_inventory": [
          {
            "field_name": "o_id",
            "domain_name": "TELEPHONE_NO",
            "algorithm_name": "dlpx-core:Phone US"
          }
        ]
      },
      {
        "source": {
          "unique_source_files_identifier": "file_identifier_9",
          "file_type": "parquet",
          "unload_split": 2,
          "source_files": [
            "warehouse/part-00000.gz.parquet"
          ]
        },
        "target": {
          "perform_join": true
        },
        "masking_inventory": [
          {
            "field_name": "w_name",
            "domain_name": "CITY",
            "algorithm_name": "USCitiesLookup"
          },
          {
            "field_name": "w_street_1",
            "domain_name": "ZIP",
            "algorithm_name": "dlpx-core:CM Alpha-Numeric"
          },
          {
            "field_name": "w_state",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          },
          {
            "field_name": "w_zip",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          }
        ]
      }
    ]
  },
  "total": 16,
  "identified": null,
  "completion": 100,
  "elapsed_time": "0:06:47.837970",
  "start_time": "2023-12-14T13:32:18.913943",
  "end_time": "2023-12-14T13:39:06.756026",
  "date_created": "2023-12-14T13:32:18.913948",
  "date_last_updated": "2023-12-14T13:32:18.913950"
}

Example 2: GET /tasks/{id} - Response JSON

CODE

{
  "task_id": "7b891bdb-0bd9-455b-a27a-eeb5eec0d5b6",
  "connector_id": 16,
  "data_set_id": null,
  "status": "SUCCESS",
  "set_id": 1,
  "scan_depth": 4,
  "file_type": null,
  "unique_source_files_identifier": "string",
  "unload_split": 2,
  "results": {
    "connector_id": 16,
    "data_info": [
      {
        "source": {
          "unique_source_files_identifier": "string_1",
          "file_type": "parquet",
          "unload_split": 2,
          "source_files": [
            "people_directory"
          ]
        },
        "target": {
          "perform_join": true
        },
        "masking_inventory": [
          {
            "field_name": "first_name",
            "domain_name": "FIRST_NAME",
            "algorithm_name": "dlpx-core:FirstName"
          },
          {
            "field_name": "last_name",
            "domain_name": "LAST_NAME",
            "algorithm_name": "dlpx-core:LastName"
          }
        ]
      }
    ]
  },
  "total": 1,
  "identified": 1,
  "completion": 100,
  "elapsed_time": "0:00:18.952383",
  "start_time": "2024-08-01T20:52:13.397608",
  "end_time": "2024-08-01T20:52:32.375221",
  "date_created": "2024-08-01T20:52:13.397619",
  "date_last_updated": "2024-08-01T20:52:13.397622"
}

You can push the generated dataset directly from the profiler using the /data-sets/{task_id} POST API endpoint. The response contains the ID of the newly created dataset on the controller.
POST /data-sets/{task_id} - Response JSON
CODE
```
{
  "data_set_id": 1
}
```

The DL entities within the default Profiler-Set with their algorithms

Type	Domain Name	Algorithm	Description
DL	FULL_NAME	dlpx-core:FullName	Full name detection
DL	FIRST_NAME	dlpx-core:FirstName	First name
DL	LAST_NAME	dlpx-core:LastName	Last name
DL	EMAIL	dlpx-core:Email SL	Email address
DL	TELEPHONE_NO	dlpx-core:Phone US	Phone or Mobile number
DL	DOB	DateShiftDiscrete	Date of Birth
DL	IP ADDRESS	dlpx-core:CM Alpha-Numeric	IP Address
DL	CREDIT CARD	CreditCard	Credit Card
DL	ADDRESS	AddrLookup	Street Address
DL	CITY	USCitiesLookup	City name
DL	COUNTRY	NullValueLookup	Country name
DL	WEB	WebURLsLookup	URL or domain name
DL	DRIVING_LC	DrivingLicenseNoLookup	US driving license
DL	SSN	dlpx-core:CM Alpha-Numeric	Social Security Number

The other available DL entities

Type	Domain Name	Description
DL	STATE	State name
DL	STATE_CODE	State Code
DL	CRYPTO	Bitcoin address
DL	IBAN_CODE	The International Bank Account Number (IBAN)
DL	US_BANK_NUMBER	A US bank account number is between 8 to 17 digits.
DL	US_ITIN	US Individual Taxpayer Identification Number (ITIN
DL	US_PASSPORT	A US passport number with 9 digits