Skip to content

Other Connectors#

Datameer Spotlight

Spectrum Spotlight#

INFO

Spectrum Spotlight gives organizations fast access and deep visibility into all of their enterprise data assets - whether in the cloud or on-premises - via a single unified self-service platform. With Spectrum Spotlight business teams can discover, access, collaborate and analyze more data for faster, more trusted cloud analytics while eliminating complex data movement and maintaining strong governance.

Spectrum Prerequisites#

To connect to Spectrum Spotlight, the following prerequisites must be fulfilled:

  • your Spectrum cluster must be run in grid mode
  • your Hadoop cluster/ HDFS must be configured to allow access to the data within the HDFS
  • the plug-in 'Datameer DWH API' (including two extensions) must be installed INFO: The plug-in is installed as default Spectrum.
  • a job must be run at least once in Spectrum and also be kept
  • any data you wish to access through Spotlight must be in the 'Parquet' format
  • your user must at least have the role 'ANALYST' to access the data:
    • read-only access to the Spectrum folders and files
Spectrum Running on Hadoop/ HDFS#

The following applies for running Spectrum on Hadoop/ HDFS:

  1. the Spectrum instance must be reachable:
    1. the REST API endpoint URL
    2. user name
    3. password
  2. the Hadoop/ HDFS instance must be reachable:
    1. the Hadoop NameNodes
    2. the Hadoop DataNodes
  3. this includes being able to resolve hostnames
  4. all ports need to be accessible (Note that the ports might have been changed by your in-house administrators for security reasons. For specific Hadoop vendors other ports might be valid. Please view also the Hadoop vendor documentation):
    1. Hadoop NameNode client port: normally 8020, 9000 or 54310
    2. Hadoop DataNodes: normally 50010 and 50020
Spectrum Running on Amazon EMR/ S3#

The following applies for running Spectrum von EMR/ S3:

  • Spotlight must be able to read the data files at the S3 folder/ file locations, provided by the Spectrum DWH API
  • access key/ secret must be known in the appropriate account that can read from Spectrum's internal storage bucket
  • have to be set as connection options when setting up the Spectrum connection
Google BigQuery

Google BigQuery#

INFO

BigQuery is Google's fully managed data warehouse for petabyte analytics.

Prerequisites#
  • Have the Avro Plug-In installed - In order to use the Google BigQuery connector make sure you have the plug-in 'Avro' installed.
Configuring Google BigQuery as a Connection#

To configure Google BigQuery as a connector:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""Create New" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Google BigQuery" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Select "Import/Export" from the drop-down 'Connection usage'.

  4. Select "IAM Authentication" from the drop-down 'Select authentication method'.

INFO: 'IAM Authentication' is the default authentication mode. No further information is necessary because it is set up in Google Cloud console.

  1. Enter the dataset for data import.

  2. Enter the Google Cloud Storage URL for data export to BigQuery and confirm with "Next". The tab 'Save' opens.

INFO: The bucket name must contain only lowercase letters, numbers, dashes, underscores and dots. Spaces are not allowed.

  1. If needed, enter a description and confirm with "Next". The 'Save Connection' dialog opens.

  2. Select the folder to save the connection, enter a name in "Save as" and confirm with "Save". The connection is saved.

JDBC HiveServer2

JDBC HiveServer2#

Note when using Hive 3: We are now converting all Spectrum DATE columns to Hive TIMESTAMP and Hive DATE type data when exporting to HIVE 3 from the Spectrum server time zone to UTC. And on import from Hive TIMESTAMP and Hive DATE back from UTC to Spectrum server time zone. Never set the property 'hive.parquet.timestamp.skip.conversion' to 'TRUE'. This would result in inconsistent data being stored in HIVE.

Hardware Recommendation#

To use the JDBC (HiveServer) connector, Spectrum recommends a Hadoop cluster with at least 16 vCPU and 30 GB RAM.

Property Recommendation#

The following property changes might help to reduce problems from hardware limitations:

  • das.job.map-task.memory=1024
  • das.job.reduce-task.memory=1024
  • das.job.application-manager.memory=1024
Performance Constraints#

Native Hive connectors use a file based approach. Through these connectors, Spectrum connects to Hive at the very beginning and requests the table description to get the data source schema and its location in HDFS. Spectrum then accesses the data directly from HDFS. This method has an efficient parallel ingestion process and achieves a great performance level. However, this approach prevents the use of custom queries or ingesting data from Hive views.

Spectrum's Hive JDBC connector allows users to benefit from custom queries and ingestion from views, but has performance constraints to be aware of.

What happens when you access Hive via JDBC (E.g., Beeline)? A user creates a query and submits it to the Hive service. The Hive service compiles a batch job (E.g., MapReduce, Tez, Spark) and submits it to a Hadoop cluster for processing. Performance of this job depends on the Hive configuration (execution frameworks, requested containers memory) and the cluster's capacity. As soon as this application is competed, the results are returned to the Beeline client or written to a file if requested. Results streaming from Hive to a client can also be started during application processing.

Spectrum compiles a batch job and submits it to the cluster when accessing Hive data via a JDBC connection. This job also makes a JDBC connection to Hive and requests Hive to execute the corresponding query (SELECT or a user's custom query). The Hive service compiles and submits its own Yarn application to process this request.

There are several bottlenecks in terms of performance in this process.

  • The performance of the job submitted by Hive depends on available cluster resources, their allocation, and the ability to parallelise the calculation.
  • The performance of the job initially submitted by Spectrum. Because it is difficult and often inefficient to split the requests sent to Hive via JDBC, by default Spectrum uses a single container to run such executions. This means that all data returned by Hive are processes written to HDFS by a single thread.

When the performance level is critical for your use-case, Spectrum recommends using the HiveServer2 connector.

Helpful Keystore and Truststore information#

In order to use SSL you need two pieces, a keystore and a truststore.

Keystore#

A keystore is located on the server side only. This file contains the private key to decrypt messages. A keystore isn't delivered by the JDK. You have to create your own keystore and import the key you got from your Certificate Authority (CA) or you can import a self generated key into your keystore.

Truststore#

A truststore is located on the client side (e.g., Spectrum). This file contains:

  • The public part of the private key from the keystore backed in a certificate (in case of self signed certificates).
  • All other root certificates called CA-certificates. These certificates are used to verify if the private key from the server is valid (SSL Handshake).

You can use the truststore provided by the JDK from Oracle. This file is located under $JAVA_HOME/jre/lib/security/cacerts.

Hive Server Side#

You can turn on SSL if you want to protect the communication between Hive Sever and any other Hive Client. To do that edit your hive-site.xml and add the following lines (The values are just examples.).

  <property>
     <name>hive.server2.use.SSL</name>
     <value>true</value>
     <description>enable/disable SSL </description>
   </property>

   <property>
     <name>hive.server2.keystore.path</name>
     <value>/home/spectrum/hive/security/hive.jks</value>
     <description>path to keystore file</description>
   </property>

   <property>
     <name>hive.server2.keystore.password</name>
     <value>spectrum</value>
     <description>keystore password</description>
   </property>
Hive Client Side#

The client is using a truststore to hold certificates which are trustworthy. You can define an own truststore or you can use the default truststore provided by Oracle. This truststore is located under $JAVA_HOME/jre/lib/security/cacerts.

There are two types of a certificate you can import into a truststore, a self signed certificate or a CA signed certificate.

Self signed certificate:

These certificates aren't part of the default truststore provided by Oracle located under $JAVA_HOME/jre/lib/security/cacerts. You must import the self signed certificate into your truststore if you want to use self signed certificates for SSL.

CA signed certificate:

You can buy a certificate by a Certificate Authority (CA). Oracle trust many different CA's. You can look into this truststore via the command keytool -list -keystore $JAVA_HOME/jre/lib/security/cacerts

You don't need to import your CA signed certificate into the JDK truststore, if your CA where you bought your cert is part of the truststore of the JDK.

Configuring JDBC HiveServer2 as a Connection#

To configure an JDBC HiveServer2 connector:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""Create New" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Hive Server2 (JDBC)" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Enter the "Connection string" and enter needed and useful properties, if needed.

  4. Select the "Authentication" type from the drop-down and for that the needed information for "User" and "Password".

  5. If needed, mark the check-box "Enable SSL" and confirm with "Next".

Knox Hive Server2 JDBC

Knox Hive Server2 JDBC#

INFO

This connection is needed when you have a Hive Server2 JDBC instance. The corresponding plug-in 'Knox Hive Server2 1.2.1.x' is part of the Hive Server2 plug-in.

The corresponding plug-in can be disabled in the 'Admin' tab.

Configuring Knox Hive Server2 JDBC as a Connection#

To configure a Knox Hive Server2 JDBC connection:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""Create New" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Knox Hive Server2 1.2.1.x" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Enter the "Hive Server2 Connection" JDBC URL.

  4. If needed, enter a database name or pattern to limit the amount of listed databases during an import.

  5. If needed, enter connection valid custom properties.

  6. Select "SSL with user provided truststore" from the drop-down, upload the "Truststore" file with the Apache Knox SSL certificate and and enter the password for truststore.

INFO: Leave the field blank when truststore is not password protected.

  1. Enter the "Topology" path.

    INFO: Entering "/" leads to the default topology path.

  2. Select "Basic (LDAP) - username/password authentication" from the drop-down, enter your credentials and confirm with "Next". The section 'Save' opens.

  3. If needed, enter a description and confirm with "Save". *The dialog 'Save Connection' opens.

  4. Select the name and storage location of the connection and confirm with "Save". *The Knox Hive Server2 JDBC connection is saved.

Disabling the Know Hive Server2 Plug-In#

To disable the Knox Hive Server2 plugin:

  1. Open the "Admin" tab and select "Plug-ins". The plug-in overview page opens.

  2. Click the "Disable Extension" icon of the extension "Know Hive Server2 1.2.1.x". The extension is disabled.

    INFO: To enable the extension, click on the icon again.

Power BI

Power BI#

Power BI is a business analytics service provided by Microsoft. It provides interactive visualizations with self-service business intelligence capabilities making it easier for business users to create reports and dashboards.

Spectrum uses the Power BI REST API which provides programmatic access to dashboard resources such as datasets, tables, and rows in Power BI.

Power BI Resources#

The following resources are necessary to complete Power BI setup. Go through them and complete any necessary steps before proceeding with setup on Spectrum:

Part of setting up Power BI requires Azure Admin Rights.

Prerequisites#
  • The Spectrum user is part of the relevant Power BI subscription.
  • The Spectrum user can provide the name of Azure Active Directory (Tenant) and the Client ID (Application ID) of the related app needed for authentication and authorization to Power BI.
  • Activate the Azure application for Spectrum using the following link in your web browser and adding your specific credentials found in the Microsoft Azure portal: https://login.microsoftonline.com/TENANT_ID/oauth2/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&response_type=code&prompt=PERMISSION#
    • The REDIRECT URI must be html encoded. Example: "https://login.live.com/oauth20_desktop.srf" is written as "https%3A%2F%2Flogin.live.com%2Foauth20_desktop.srf"
    • The PERMISSION configuration setting can be set as consent for a normal user or admin_ consent for an admin user giving access to all users in the Active Directory.
Configuring Power BI as a Connector#

To configure Power BI as a connector:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""Create New" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Power BI" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Enter the "Client ID", the "Tenant ID", the "Username" and the "Password" and confirm with "Next". The section 'Save' opens.

  4. If needed, enter a description and confirm with "Save". The dialog 'Save Connection' opens.

  5. Select the name and storage location of the connection and confirm with "Save". The Knox Hive Server2 JDBC connection is saved.

Tableau Server

Tableau Server#

INFO

Spectrum supports Tableau 9 and 10. Other versions might not function as expected and aren't supported.

Configuring Tableau Server as a Connection#

To configure Tableau Server as a connection:

  1. Select "+" and then select "Connection" or right-click on a folder in the File Browser and select ""Create New" and then ""Connection". A 'New Connection' tab opens.

  2. Select "Tableau Server" from the drop-down in section 'Type' and confirm with "Next". The section 'Connection Details' opens.

  3. Enter the Tableau Server connection URL, if needed, the site ID to access Tableau through a site other than the default site and set the connection usage to "Export".

    INFO: The connection usage for Tableau can only be set for exports.

  4. Select the authentication mode from the drop-down: "User/ Password" or "Personal Access Token". The fields adapt to the chosen authentication mode.

    If selecting 'User/ Password', enter the username and password.

    If selecting 'Personal Access Token', enter the token name and the token secret.

  5. If needed, enter custom properties, e.g. 'das.splitting.disable-combining=true' to export large Workbooks faster, and click on "Next". The section 'Save' opens.

  6. If needed, enter a description and confirm with "Save". The dialog 'Save Connection' opens.

  7. Select the name and storage location of the connection and confirm with "Save". The Tableau connection is saved.