Installation and setup (Parquet profiler)

Pre-requisites

Download the latest version of the Parquet Profiler from the Delphix download page.
Ensure that the host running the profiler has docker-compose installed (the profiler has been tested only in the docker-compose environment).
The profiler supports AWS S3 buckets as the source locations. We need to ensure that the profiler has access to the source location (similar to how access was set up for the Hyperscale Parquet Connector). You can use the following authentication mechanisms:
- Attaching the EC2 host running the profiler with an AWS IAM role which has access to the source S3 buckets.
  - IAM Roles are designed for applications to securely make AWS-API requests from EC2 instances, without the necessity to manage the security credentials that the applications use.
  - Using the AWS console UI or AWS CLI, attach the IAM role to the EC2 instance running the Hyperscale services. To know more, check the AWS Documentation.
- Generating an AWS Access Key ID & AWS Secret Acess Key pair for an AWS Role which has the privileges to access the source S3 bucket.
  - Access keys are long-term credentials generated for an IAM user or role. These keys can be for programmatic requests to the AWS CLI or AWS API (directly or using the AWS SDK). For more information, refer to the AWS Documentation.
The Profiler also supports Hadoop as a source location. There are two ways of doing a profiler with Hadoop as a source:
1. HADOOP-FS: With this approach, the profiler will access the HDFS path (given in connector_info) with Kerberos authentication and will scan all parquet files in the given HDFS path.
2. HADOOP-DB: In this approach, the profiler will access the hive database with Kerberos authentication. You have to provide a list of tables while creating the profiler task. The profiler will scan the given tables with the help of Hadoop hive database details given in connector-info.
Set up the Hyperscale File Connector and add the required ConnectorInfo details.

Procedure

Untar the profiler downloaded from Delphix’s download page. It should contain the docker images for the profiler and the docker-compose.yaml file to run the profiler.
CODE
```
tar -xf parquet-profler.tar.gz
```

Load the delphix-hyperscale-profiler-api and delphix-hyperscale-profiler-backend docker images.

CODE

docker load --input delphix-hyperscale-parquet-profiler-api.tar
docker load --input delphix-hyperscale-parquet-profiler-backend.tar

Edit the docker-compose YAML file to map the controller end-point for the delphix-hyperscale-profiler-api to interact with.
CODE
```
services:
  ...
  profiler-api-service:
  ...
    environment:
    ...
      - CONTROLLER_URL=https://<controller-ip>/api
```

(Optional) You can provide the AWS Access keys as environment variables as well, it will be considered as the default credentials to access the source S3 location.

CODE

services:
  ...
  profiler-api-service:
  ...
    environment:
    ...
      - AWS_ACCESS_KEY_ID=<access_key_id>
      - AWS_SECRET_ACCESS_KEY=<secret_access_key>
      - AWS_DEFAULT_REGION=<region>

To profile the data which is present on Hadoop, the user has to provide core-site.xml, krb5.cong and hadoop.keytab files as volume mount

CODE

services:
  ...
  profiler-api-service:
  ...
    volumes:
      ...
      - /path/to/keytab_file/hadoop.keytab:/app/hadoop.keytab
      - /path/to/hadoop/core-site.xml:/app/hadoop/etc/hadoop/core-site.xml
      - /path/to/etc/krb5.conf:/etc/krb5.conf

Start the profiler service.
CODE
```
docker-compose up -d
```
Access the profiler swagger UI at http://<host-ip>:8888.