Installation and setup (Parquet profiler)
Pre-requisites
Download the latest version of the Parquet Profiler from the Delphix download page.
Ensure that the host running the profiler has
docker-compose
installed (the profiler has been tested only in thedocker-compose
environment).The profiler supports AWS S3 buckets as the source locations. We need to ensure that the profiler has access to the source location (similar to how access was set up for the Hyperscale Parquet Connector). You can use the following authentication mechanisms:
Attaching the EC2 host running the profiler with an AWS IAM role which has access to the source S3 buckets.
IAM Roles are designed for applications to securely make AWS-API requests from EC2 instances, without the necessity to manage the security credentials that the applications use.
Using the AWS console UI or AWS CLI, attach the IAM role to the EC2 instance running the Hyperscale services. To know more, check the AWS Documentation.
Generating an AWS Access Key ID & AWS Secret Acess Key pair for an AWS Role which has the privileges to access the source S3 bucket.
Access keys are long-term credentials generated for an IAM user or role. These keys can be for programmatic requests to the AWS CLI or AWS API (directly or using the AWS SDK). For more information, refer to the AWS Documentation.
The Profiler also supports Hadoop as a source location. There are two ways of doing a profiler with Hadoop as a source:
HADOOP-FS: With this approach, the profiler will access the HDFS path (given in connector_info) with Kerberos authentication and will scan all parquet files in the given HDFS path.
HADOOP-DB: In this approach, the profiler will access the hive database with Kerberos authentication. You have to provide a list of tables while creating the profiler task. The profiler will scan the given tables with the help of Hadoop hive database details given in connector-info.
Set up the Hyperscale File Connector and add the required ConnectorInfo details.
Procedure
Untar the profiler downloaded from Delphix’s download page. It should contain the docker images for the profiler and the
docker-compose.yaml
file to run the profiler.CODEtar -xf parquet-profler.tar.gz
Load the
delphix-hyperscale-profiler-api
anddelphix-hyperscale-profiler-backend
docker images.CODEdocker load --input delphix-hyperscale-parquet-profiler-api.tar docker load --input delphix-hyperscale-parquet-profiler-backend.tar
Edit the
docker-compose
YAML file to map the controller end-point for thedelphix-hyperscale-profiler-api
to interact with.CODEservices: ... profiler-api-service: ... environment: ... - CONTROLLER_URL=https://<controller-ip>/api
(Optional) You can provide the AWS Access keys as environment variables as well, it will be considered as the default credentials to access the source S3 location.
CODEservices: ... profiler-api-service: ... environment: ... - AWS_ACCESS_KEY_ID=<access_key_id> - AWS_SECRET_ACCESS_KEY=<secret_access_key> - AWS_DEFAULT_REGION=<region>
To profile the data which is present on Hadoop, the user has to provide
core-site.xml
,krb5.cong
andhadoop.keytab
files as volume mountCODEservices: ... profiler-api-service: ... volumes: ... - /path/to/keytab_file/hadoop.keytab:/app/hadoop.keytab - /path/to/hadoop/core-site.xml:/app/hadoop/etc/hadoop/core-site.xml - /path/to/etc/krb5.conf:/etc/krb5.conf
Start the profiler service.
CODEdocker-compose up -d
Access the profiler swagger UI at
http://<host-ip>:8888
.