Skip to content

Connectors for Cloud Storage

Find here all information about connectors for cloud storages.

Amazon Redshift - Fast Load

Amazon Redshift - Fast Load#

Use the Amazon Redshift (Fast load) connector as a fast exporting method by loading your data into your S3 server and then copying the data to your Redshift database. When exporting data, the S3 server is used to temporarily store files. Temporary files are removed from the S3 server once export job has completed. If an export job fails, users may see these temporary files still residing on the S3.

INFO

In order to use this connector, you must first download the following JAR file: Redshift JDBC 4.1

This JAR file must then be added to your custom-jar folder located in the path:

<Your path to Spectrum>/modules/dap-conductor/build/<Spectrum Version>/etc/custom-jars

Configuring Amazon Redshift (Fast Load) as a Connection#

To configure an Amazon Redshift - Fast Load connection:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""CreateNew" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Amazon Redshift (Fast load)" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Enter the Amazon Redshift credentials and S3 credentials.

INFO: The connection usage setting is only available for export jobs.

  1. If required, add a description and click "Save".
Azure Data Lake Storage Gen 2

Azure Data Lake Storage Gen 2#

INFO

Azure Data Lake Storage Gen 2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 is the result of converging the capabilities of the two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1.

Preparing the Azure Data Lake Storage#
INFO

Prepare your Azure Data Lake Storage instance in your account under https://portal.azure.com/. Find more information about the preparation here and here.

Having the Plug-In Installed#
INFO

The plug-in 'Azure Blob Storage' must be installed in the 'Admin' tab. This comes automatically with your Spectrum distribution.

Configuring Azure Data Lake Storage Gen 2 as a Connection#

To configure Azure Data Lake Storage Gen 2 as a connector:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""CreateNew" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Azure Blob Storage" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Enter the "storage account" name.

  4. Enter the name of the storage container.

  5. Enter the "storage access key".

  6. Enter the "root path prefix".

  7. Select your data transfer channel from the drop-down.

  8. Select if you want to use the connection for import, export or both and confirm with "Next". The 'Save Connection' tab opens.

  9. If needed, enter a description and confirm with "Next". The 'Save Connection' dialog opens.

  10. Select the folder to save the connection, enter a name in "Save as" and confirm with "Save". The connection is saved. Configuring the Azure Data Lake Storage Gen 2 connection is finished.

Google Cloud Storage

Google Cloud Storage (GCS)#

Configuring Google Cloud Storage as a Connection#

To configure a Google Cloud Storage connection:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""CreateNew" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Google Cloud Storage" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Option 1: Select "IAM Authentication" as the authentication method.

INFO: 'IAM Authentication' is the default authentication mode. No further information is necessary because it is set up in Google Cloud console.

  1. Option 2: Select "Service Account Key Authentication" as the authentication method.

  2. Add your 'Google Service Account private key file' as a JSON file by clicking on "Upload". Your File Browser opens.

  3. Select the JSON file and confirm. The data path is shown next to the upload button.

  4. Add your bucket in Google Cloud Storage which will be used for the connection.

INFO: Note that a Google Cloud Storage bucket name must not contain "_".

  1. Select the connection usage "Import/Export" out of the drop-down list and click "Next". You will be guided to the 'Save' settings.

  2. If required, add a description and click "Save". The dialog 'Save Connection' opens.

  3. Select the place to save the connection under "Data""Connections" and name the connection, then confirm with "Save". The connection is shown in the File Browser.

S3 Native

S3 Native#

Spectrum's Amazon S3 Native connector uses multipart technology to boost the performance. Those using the S3 connector might think about updating to this S3 Native connector. The S3 Native connector has the ability to export to S3 buckets where Spectrum can't read the getObjectMetadata() method. The older S3 connector must be able to read the getObjectMetadata() method in order to read and write to/from S3 buckets.

Configuring S3 Native as a Connection#

To configure a S3 Native connection:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""CreateNew" and then ""Connection". A 'New Connection' tab opens.

  2. Select "S2 Native" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Select the authentication process.

    If using an access key and secret key, enter those below.

    If using IAM, Spectrum's S3 client will use the instance profile credentials to sign and authenticate the S3 requests.

    Select an option for the algorithm encryption method, enter the AWS region, and enter the S3 bucket name.

  4. Enter the uploading parameters.

    Planing for memory usage:

    By default, Spectrum starts 3 threads per stream with each stream having a queue capacity of 3 tasks. Each part (the buffer size) is set at 10MB. The result consumes 3x3x10MB = 90MB RAM.

    Size planning:

    Amazon sets a limit of 10,000 parts for any multipart upload job. As a result, each Spectrum task can only upload as a 100GB (10,000*10MB = 100GB) file using the default settings. Increase the part size setting if you have a larger file. Note that this is per Spectrum task. If you have five exporting tasks, the original worksheet data is divided into five with each having a 100GB limit.

    INFO: These settings can be overridden in each export job.

  5. If required, add a description and click Save.

Snowflake

Snowflake#

Configuring Snowflake as a Connection#

To create a Snowflake connection:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""CreateNew" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Snowflake" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Enter the Snowflake URL and your Snowflake credentials.

  4. Chose the storage export environment. Mark either "AWS Simple Storage Service" or "Azure Blob Storage".

INFO: The Data is first exported to S3 or Blob Storage and then will be moved to Snowflake.
INFO: For S3, enter the access key, access secret, your region and the bucket. INFO: For Blob Storage, enter the account, the container and the credentials.

  1. Confirm with "Next". The section 'Save' opens.*

  2. If needed, enter a description and confirm with "Save". The dialog 'Save Connection' opens.

  3. Select the name and storage location of the connection and confirm with "Save". The Snowflake connection is saved.