Connectors for Files#

INFO

Find here all information about connectors for files.

Apache Knox WebHDFS

Apache Knox WebHDFS#

INFO

The connector can be disabled in the 'Admin' tab.

This connection is needed to access a HDFS which is not the same HDFS where Spectrum's Private Folder is stored, e.g. a hybrid cloud or second cluster.

Configuring WebHDFS as a Connection#

To connect to Apache Know WebHDFS:

Select "+" and then select "Connection" or right-click on a folder in the File Browser and select "CreateNew" and then "Connection". A 'New Connection' tab opens.
Select "Apache Knox secure WebHDFS" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.
Enter the "Apache Knox gateway hostname" and the "Port".
Enter the "Topology" path.

INFO: Entering "/" leads to the default topology path.

Select "SSL with user provided truststore" from the drop-down, upload the "Truststore" file with the Apache Knox SSL certificate and and enter the password for truststore.

INFO: Leave the field blank when truststore is not password protected.

Select "Basic (LDAP) - username/password authentication" from the drop-down, enter your credentials and confirm with "Next". The section 'Save' opens.
If needed, enter a description and confirm with "Save". The dialog 'Save Connection' opens.
Select the name and storage location of the connection and confirm with "Save". The Knox WebHDFS connection is saved.

Disabling the Apache Know WebHDFS Plug-In#

To disable the Apache Know WebHDFS plug-in:

Open the "Admin" tab and select "Plug-ins". The plug-in overview page opens.
Click the "Disable Extension" icon of the extension "Apache Knox secure WebHDFS". The extension is disabled.

INFO: To enable the extension, click on the icon again.

Azure Blob Storage

Azure Blob Storage#

INFO

This connector is only available as a separate plug-in for distributions based on Apache Hadoop 2.7.0+.

Configuring Azure Blob Storage as a Connection#

To connect to Azure Blob Storage:

Click the ""+"" button and select "Connection" or right-click in the browser and select "Create new" → "Connection".
Click on the drop-down list and choose "Azure Blob Storage" as the connection type.
Click "Next".
In the 'Storage Account Name' field, enter the "name" of your "Azure Blob Storage account".
Add the "container name" you selected when creating your storage.
Enter your "access key" and any necessary "root path prefixes".
Select whether the connection should be used for import, export, or both.
Add a description and click "Save".
Give your Azure Blob Storage connection a name and confirm with "Save".

Importing Data with the Azure Blob Storage Connector#

To import with an Azure Blob Storage connector:

Click the "+" button and select "Import Job" or right-click in the browser and select "Create new" → "Import Job".
Click "Select Connection" and choose the name of your connection.
Enter the path for a file or folder. A preview of the imported data is displayed.
Review the schema.
Review the schedule, data retention, and advanced properties for the job.
Add a description, click "Save", and name the file.

Custom Protocol

Custom Protocol#

Configuring a Custom Protocol as a Connection#

To import a Custom Protocol connection:

Click the "+" button and select "Connection" or right-click in the browser and select "Create new" → "Connection".
From the drop-down list, select "Custom Protocol(including http/https)" as the connection type.
Click "Next".
Enter the "Base URL" to connect with the support file system.

INFO: You can also enter a "User" or "Password" insert into the URL automatically.

Enter any custom properties for the file system you are connecting to and select whether to use the connection for imports, exports, or both.
Click "Next".
If required, add a description and click "Save".

Importing Data with a Custom Protocol Connector#

To import with a Custom Protocol Connector:

Click the "+" button and select "Import Job" or right-click in the browser and select "Create new" → "Import job".
Click "Select Connection" and choose the name of your custom protocol connection and then click "Next".
Enter the "File" path for the custom protocol, add the "XML Record Tag Name" to choose which record to extract, and add necessary "Field XPaths".
Click "Next."" A preview of the imported data is displayed.
Review the schema.
Review the schedule, data retention, and advanced properties for the job.
Add a description, click "Save", and name the file.

HDFS

HDFS#

Configuring HDFS as a Connection#

To import from HDFS:

Click the "+" button and select "Connection" or right-click in the browser and select "Create new" → "Connection".
Click on the drop-down list and choose "HDFS" as the connection type.
Click "Next".
Enter the "host address" for the HDFS server and the "port number".
Add the "root path prefix" of the directory from which to import data.
If needed, add needed "custom properties".
Add a description and click "Save".

HDFS High Availability#

HDFS has added high-availability capabilities allowing the main metadata server (the NameNode) to be failed over manually to a backup in the event of failure.

Follow the setup instructions above to set up the HDFS HA (High Availability) feature on the connector.

Use a logical namenode name in the hostname. Keep the port field empty.

Add the custom properties:

dfs.nameservices=nnmain
dfs.ha.namenodes.nnmain=nn0,nn1
dfs.namenode.rpc-address.nnmain.nn0=ip-<host-name>:<port>
dfs.namenode.rpc-address.nnmain.nn1=ip-<host-name>:<port>
dfs.client.failover.proxy.provider.nnmain=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

Importing Data with the HDFS Connector#

To import from a HDFS connection:

Click the "+" button and select "Import Job" or right-click in the browser and select "Create new" → "Import job".
Click "Select Connection" and choose the name of your HDFS connection.
Enter the path for a file or folder on the HDFS.

INFO: The filter box at the top of the Remote Data Browser can be used to find folders/files within the current directory. Spectrum has added the Remote Data Browser feature. This File Browser gives you a visual interface to select the file from your HDFS for which to import.

A preview of the imported data is displayed. Review the schema.
Review the schedule, data retention, and advanced properties for the job.
Add a description, click "Save", and name the file.

Amazon S3

Amazon S3#

Prerequisites#

Spectrum needs to have the appropriate access level to establish a connection to a certain S3 bucket. For that, you need the permission to the bucket's root folder with 'ListBucket'. Giving the permission to a subfolder instead, the connection attempt will fail.

Usable Custom Properties#

Disabling the IAM Role Authentication Option

Disabling the IAM role mode allows more strict access control to S3 buckets requiring the IAM with assumed role mode to be used instead. In IAM with assumed role data access will happen under an assumed AWS IAM role explicitely defined in the S3 connection.

To disable the IAM role option enter the following custom property on the Admin's page within the 'Custom Properties' settings:

'das.s3.iam.without.assumed.role.enabled=true'

Configuring Amazon S3 as a Connection#

To configure an Amazon S3 connection:

Click the "+" button and select "Connection" or right-click in the File Browser and select "Create New" → "Connection". The "New Connection" tab appears in the menu bar.
Select "S3" from the drop-down and confirm with "Next". The type is displayed in the drop-down.
Enter the "S3 Bucket".
Select the encryption algorithm.
Select how to authenticate. The details below adapt to the selection.

INFO: Selecting 'IAM role' will use the IAM role associated with the EC2 instance or the EMR cluster. Selecting the 'IAM assumed role' will use the IAM role associated with the EC2 instance or the EMR cluster to assume a specific role.
When selected 'Access key and secret', enter the "Access key", if needed the "Access secret", the "Root path prefix", the "Region" as well as the "S3 Endpoint" and confirm with "Next". The "Save" tab opens.
If needed, enter a description, and confirm with "Next". The 'Save Connection' dialog opens.
Select the folder to save the connection in and enter a name in 'Save as'. Confirm with "Save". The connection is saved. Configuring Amazon S3 as a connection is finished.

Importing Data with a S3 Connector#

INFO

This connector isn't able to import from S3 buckets without access to read the metadata over the getObjectMetadata() method.

To import data with a S3 connector:

Click the "+" button and select "Import Job" or right-click in the browser and select "Create new" → "Import job".
Click "Select Connection" and choose the name of your S3 connection and then click "Next".
Enter the "file" or folder path and the delimiter.

INFO: Spectrum has added the Remote Data Browser feature. This File Browser gives you a visual interface to select the file from your S3 bucket for which to import. The filter box at the top of the Remote Data Browser can be used to find folders/files within the current directory.
Click "Next".
A preview of the imported data is displayed. Review the schema.
Review the schedule, data retention, and advanced properties for the job.
Add a description, click "Save", and name the file.

SFTP

SFTP#

INFO

Secure File Transfer Protocol (SFTP) is a network protocol that provides file access, file transfer, and file management over any reliable data stream.

Prerequisites#

INFO

Set up the required authentication mode for SSH/SFTP within the 'ssh_config' and 'sushi_config' in your environment.

Configuring SFTP as a Connection#

To configure a SFTP connection:

Click the "+" button and select "Connection" or right-click in the browser and select "Create new" → "Connection".
From drop-down list, select "SFTP" as the connection type.
Enter the SFTP host name(s), the port number, authentication credentials, ssh key (if needed), and the root path prefix.
Select if the connection to be used for import, export, or both.
If required, add a description and click "Save".

Importing Data with a SFTP Connector#

To import from SFTP:

Click the "+" button and select "Import Job" or right-click in the browser and select "Create new" → "Import job".
Click "Select Connection" and choose the name of your SFTP connection.
Choose what type of file to import and then click "Next".
Enter the path to the file or folder you want to access and set the delimiter.

INFO: The filter box at the top of the Remote Data Browser can be used to find folders/files within the current directory. Spectrum has added the Remote Data Browser feature. This File Browser gives you a visual interface to select the file from your SFTP connection for which to import.

A preview of the imported data is displayed. Review the schema.
Review the schedule, data retention, and advanced properties for the job.
Add a description, click "Save", and name the file.

SSH

SSH#

Configuring SSH as a Connection#

To configure a SSH connection:

Click the "+" button and select "Connection" or right-click in the browser and select "Create new" → "Connection".
From drop-down list, select "SSH" as the connection type.
Enter your "host name" and "port number", if necessary.
Select whether credentials should be provided or asked for.

INFO: If you choose the former, add the required user name and password.

Add the root path prefix or SSH key.

INFO: The SSH key has a specific format.

The key must begin with:

---- BEGIN RSA PRIVATE KEY ----

and end with the following line:

---- END RSA PRIVATE KEY ----

Indicate whether the connection should be used for import, export, or both.
Click "Next".
If required, add a description and click "Save".

Importing Data with a SSH Connector#

To import from a SSH connection:

Click the "+" button and select "Import Job" or right-click in the browser and select "Create new" → "Import job".
Click "Select Connection" and choose the name of your SSH connection.
Enter the file or folder path and the delimiter.

INFO: Spectrum has added the Remote Data Browser feature. This File Browser gives you a visual interface to select the file from your SSH connection for which to import. The filter box at the top of the Remote Data Browser can be used to find folders/files within the current directory.

A preview of the imported data is displayed. Review the schema.
Review the schedule, data retention, and advanced properties for the job.
Add a description, click "Save", and name the file.