Redshift Connector

Amazon Redshift is an Internet hosting service and data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services.

Connector Configuration

Redshift requires the following parameters:

host - a Redshift endpoint
user-name - specifies the name of the user's account (provided by Redshift);
password - specifies the password for the specified user;
db - specifies the default database to use once connected.

Example:

call SYSADMIN.createConnection(name => 'redshift', jbossCLITemplateName => 'redshift', connectionOrResourceAdapterProperties => 'host=<host>,port=5439,db=<database>,user-name=<user_name>,password=<password>') ;;
call SYSADMIN.createDataSource(name => 'redshift', translator => 'redshift', modelProperties => 'importer.schemaPattern=test_nk,importer.useFullSchemaName=FALSE,importer.tableTypes="TABLE,VIEW"', translatorProperties => 'supportsNativeQueries=true') ;;

SQL

Proxy settings for Amazon S3 upload can be configured via System Properties.

Proxy settings for Amazon S3 available since 2.1.7

Translator Properties

Loading data using Amazon AWS S3 (S3LOAD) can be configured and is highly recommended for any productive usage, as inserting data into Redshift using standard JDBC protocol can be extremely slow.

The following translator parameters are required to configure S3LOAD

uploadMode=s3load - explicitly specifies S3LOAD mode;
region - AWS S3 region endpoint;
bucketName - optional, specifies a bucket name to upload data files to;
bucketPrefix - optional, specifies a prefix of the temporary bucket to upload data files to if bucketName is not specified; must comply with Amazon S3 bucket naming convention (nb: 36 characters would be added to the bucket prefix when creating a temporary bucket);
createBucket - optional, specifies if the bucket set in bucketName parameter should be created if it does not exist, the default value is false;
keyId - AWS S3 key ID;
secretKey - AWS S3 secret key.

Example:

call SYSADMIN.createConnection(name => 'redshift', jbossCLITemplateName => 'redshift', connectionOrResourceAdapterProperties => 'host=<host>,port=5439,db=<database>,user-name=<user_name>,password=<password>') ;;
call SYSADMIN.createDataSource(name => 'redshift', translator => 'redshift', modelProperties => 'importer.schemaPattern=test_nk,importer.useFullSchemaName=FALSE,importer.tableTypes="TABLE,VIEW"', translatorProperties => 'varcharReserveAdditionalSpacePercent=300,supportsNativeQueries=true,uploadMode=s3Load,region=<region>,bucketName=<bucket_name>,createBucket=true,keyId=<key_ID>,secretKey="<secret_key>"') ;;

SQL

bucketPrefix translator property is available since 2.1.7

createBucket translator property is available since 2.1.15

Distribution and Sort Keys

Redshift does not support indexes but supports distribution and sort keys that can be used to improve the performance of queries.

With respect to indexes, distkeys and sortkeys must be defined when the table is created.

Sort Keys

SORTKEYs are created analyzing the currently recommended indexes collected for each optimization.

According to the documentation, SORTKEYs can be specified both at column and table levels. It is possible to specify only one SORTKEY column (at column level) or multiple columns if defined at the table level.
Since it is possible to specify only one SORTKEY(with one or more columns) at the table level, we decided to create a SORTKEY corresponding to the recommended index (with kind SINGLE or MULTIPLE) with the highest frequency. The system will create then a SORTKEY with one column or with multiple columns if the highest freq index is SINGLE or MULTIPLE, respectively.

Columns that are normally recommended for index creation are used to define dist and sort keys.

Dist Keys

DISTKEYs are not automatically recommended by the system and they need to be manually created by the user.

It is not possible to specify more than one DISTKEY for each recommended optimization.

Please, note that the IndexType of a DISTKEY must be set to MANUAL (which is the default, so specifying IndexType may be just skipped).

Distribution Style

Defines the data distribution style for the whole table. Amazon Redshift distributes the rows of a table to the compute nodes according to the distribution style specified for the table.

The distribution style that you select for tables affects the overall performance of your database.

EVEN: The data in the table is spread evenly across the nodes in a cluster in a round-robin distribution. Row IDs are used to determine the distribution, and roughly the same number of rows are distributed to each node. This is the default distribution method.
KEY: The data is distributed by the values in the DISTKEY column. When you set the joining columns of joining tables as distribution keys, the joining rows from both tables are collocated on the compute nodes. When data is collocated, the optimizer can perform joins more efficiently. If you specify DISTSTYLE KEY, you must name a DISTKEY column.
ALL: A copy of the entire table is distributed to every node. This distribution style ensures that all the rows required for any join are available on every node, but it multiplies storage requirements and increases the load and maintenance times for the table. ALL distribution can improve execution time when used with certain dimension tables where KEY distribution is not appropriate, but performance improvements must be weighed against maintenance costs.

Internal

SORTKEY and DISTKEY created for a table in Redshift can be checked with a query like this (to be executed directly on Redshift). The schema, that contains the table, has to be in the search path.

select tablename, "column", type, encoding, distkey, sortkey from pg_table_def where tablename like 'mat_table%';

SQL

Usage Examples

SORTKEY, DISTKEY and DISTSTYLE are passed as OPTIONS. They are added to CREATE TABLE respectively SELECT INTO-command as shown below:

CREATE TABLE source.table_name (id integer, name varchar(255))
OPTIONS (DISTKEY 'id', SORTKEY 'id,name') ;;
 
SELECT * INTO target.table_name FROM source.table_name
OPTIONS (SORTKEY 'id', DISTSTYLE 'EVEN') ;;

SQL

Redshift Spectrum

The Redshift JDBC driver exposes Spectrum tables as EXTERNAL TABLE. In order to have your Redshift data source list Spectrum tables, adjust the data source parameter importer.tableTypes accordingly by specifying e.g. importer.tableTypes="TABLE,VIEW,EXTERNAL TABLE".