Skip to content

Supported External Data Types and Sources#

Spectrum supports the following types of structured, semi-structured, and unstructured data types and sources/ connectors when importing and exporting.

Supported External Connectors#

Connectors for Web Services#
Supported External Connector Description
Amazon EC2 a secure and resizable compute capacity to support virtually any workload
Atlassian Jira an issue and product tracking software from Atlassian to import data from JIRA via JQL (Java Query Language)
GitHub a software for distributed version management of files
Google AdSense an online service that displays advertising on websites outside of in-house offerings
Google Analytics a tracking tool , which is used for traffic analysis of websites
Google Fusion Tables a web service provided by Google for data management. Fusion Tables can be used to collect, visualize and share data tables
Google Plus Google's former social network
Google Spreadsheet Google's spreadsheet program that is part of the free, web-based Google Docs Editors Suite
Marketo REST - Lead List a marketing automation software for account-based marketing
Marketo Soap - Lead Activity creation, retrieval and removal of entities and data stored within Marketo
New York Times API search New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia, and other article metadata
Salesforce an enterprise cloud computing firm that specializes in social and mobile cloud technologies, including sales and CRM applications helping companies connect with customers, partners, and employees
Twitter REST API perform searches on Twitter
Zendesk a cloud-based customer support platform
Connectors for Files#
Supported External Connector Description Available for Import Available for Export Notes
Apache Knox WebHDFS is the REST API and application gateway for the Hadoop ecosystem (tick) (tick)
Amazon S3 Amazon's Cloud online storage to store, organize and save data (tick) (tick)
Azure Blob Storage a Microsoft storage service for large unstructured binary and text data. Available for Spectrum HDP 2.0+ and CDH 4+ users. Please contact our services department for the connector plug-in (tick) (tick) (For HDP2.0, 2.2 and CDH+4 users) Contact Spectrum services for info
Custom Protocol can be assigned the same name as a pre-defined protocol, in order to extend the number of IP addresses or ports associated with the original protocol (tick) (tick)
Spectrum Server Filesystem the local Spectrum filesystem (tick) (tick)
FTP (File Transfer Protocol) a standard network protocol used to transfer from one host or to another host over a TCP-based network, such as the internet (tick) (error)
HDFS (Hadoop Distributed File System a distributed file system used by Hadoop applications that creates multiple replicas of data blocks and distributes them on nodes throughout a cluster to allow extremely rapid computations (tick) (tick)
OpenStack Swift offers cloud storage software so that you can store and retrieve lots of data with a simple API. Is built for scale and optimized for durability, availability, and concurrency across the entire data set. Is ideal for storing unstructured data that can grow without bound (tick) (tick)
SFTP (SSH File Transfer Protocol) transfers files and encrypts both commands and data, preventing passwords and sensitive information from being transmitted openly over the network (tick) (tick)
SSH (Secure Shell) is a set of Unix utilities including SCP and SFTP, based on SSL, which uses a simple Public Key Infrastructure and Encryption to allow you to securely transfer files between Unix file systems INFO: Spectrum supports Bitverse SSH Server/Client for the Windows platform. The root paths to be specified while creating the connection should look something like: /c:/mydata/folder1 (tick) (tick)
MapR FS a clustered field system that supports both very large-scale and high-performance uses (tick) (tick)
Connectors for Cloud Storages#
Supported External Connector Description Available for Import Available for Export Notes
Amazon Redshift - Fast Load a fast exporting method by loading your data into your S3 server and then copying the data to your Redshift database (tick) (tick)
Azure Data Lake Storage Gen 2 a set of capabilities dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 is the result of converging the capabilities of the two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1 (tick) (tick)
Google Cloud Storage is a REST file hosting web service for storing and retrieving data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing features (tick) (tick)
S3 Native Amazon's Cloud online storage to store, organize and save data (error) (tick)
Snowflake a comprehensive data platform provided as Software-as-a-Service (SaaS). Enables data storage and analytic solutions (tick) (tick)
Connectors for Databases#
INFO

Relational databases include Oracle, DB2, and MySQL

Supported External Connector Description Available for Import Available for Export Notes
Amazon Athena a query service to run Sql queries against their data (tick) (error)
Amazon Redshift a quick, scalable data warehouse as a service from the cloud (tick) (tick) Native Amazon Redshift JDBC 4.1 driver or a PostgreSQL jdbc driver can be used
Azure Cosmos DB a fully managed NoSQL database service (tick) (tick)
Azure Databricks an Apache Spark based analytics service with an interactive workspace (tick) (error)
Azure Synapse an unlimited analytics service which enables flexible data queries as you see fit, using on-demand server less resources or provisioned resources at scale (tick) (tick)
DB2 IBM's relational database management system (tick) (tick)
Greenplum an open-source massively parallel processing (MPP) database (tick) (tick)
HSQL_file a lightweight, 100% Java SQL Database Engine (tick) (tick)
MSSQL a relational database based on structured query language (tick) (tick)
MySQL a relational database based on structured query language (tick) (tick)
Netezza a column-oriented database management system (tick) (tick)
Oracle a relational database management system designed for grid computing inclusive CLOB support for importing data (tick) (tick)
PostgreSQL an object-relational database management system (ORDBMS) (tick) (tick)
Sybase IQ a column-based, relational database software (tick) (tick)
Teradata Aster a relational database based on structured query language (tick) (tick) Teradata database needs to be configured to support the appropriate character set
Vertica 5.1+ a grid-based and column-oriented analytic database software (tick) (tick)
Other Connectors#
Supported External Connector Description Available for Import Available for Export Notes
Spectrum Spotlight gives organizations fast access and deep visibility into all of their enterprise data assets - whether in the cloud or on-premises - via a single unified self-service platform. With Spectrum Spotlight business teams can discover, access, collaborate and analyze more data for faster, more trusted cloud analytics while eliminating complex data movement and maintaining strong governance (tick) (error)
Google BigQuery is Google's fully managed data warehouse for petabyte analytics (tick) (tick)
HBase is an open-source non-relational distributes database. Is written in Java and runs on top of HDFS (tick) (error) In order to satisfy the classloader requirements, hbase-protocol.jar must be included in Hadoop's classpath and the root Spectrum classpath (/etc/custom-jars) for version 0.96.1 to 0.98.0Learn more on the Apache HBase Reference.
Hive Metastore a service that stores metadata related to Apache Hive and other services, in a backend RDBMS, such as MySQL or PostgreSQL (tick) (tick)
Hive (JDBC) an open source data warehouse system for querying and analysing large data sets stored in Hadoop (tick) (tick)
Hive Server2 (JDBC) a service that enables clients to execute queries against Hive. It supports multi-client concurrency and authentication. Provides support for open API clients like JDBC and ODBC (tick) (error)
IMAP & POP3 (Internet Message Access Protocol) IMAP is the internet standard protocol used by email clients to retreive email messages from a mail server over a TCP/ IP connection. POP3 is a client/ server protocol in which email is received and held (error) (tick)
Knox Hive Server2 JDBC the security instance when you have a Hive Server2 JDBC instance running (tick) (tick)
Power BI a business analytics service provided by Microsoft. Provides interactive visualizations with self-service business intelligence capabilities (error) (tick)
Tableau Server visual analytics platform to host, and hold all tableau Workbooks, datasources and more (error) (tick) minimum CentOS 7 as operating systemrequirements on Hadoop cluster`s operation system libraries:GNU C Library (libc6) version >= 2.15GNU Standard C++ Library v3 (libstdc++6) version >= 6.1.0
Info

Spectrum is able to split large files across multiple mappers enabling parallel data ingestion. Two requirements must be fulfilled for this to be possible.

  • Splitting of the file protocol must be supported. Currently splitting all of the above protocols is supported.
  • Splitting of the compression type must be supported. Currently LZO and Gzip are splittable, zip and Bz2 aren't supported.

See Importing Data for more information.

Supported External Data Types#

You can import or upload individual sheets from a spreadsheet by first converting the file to a .CSV file type.

File Type Description Available for Import Available for Export Notes
Apache log files Record of all incoming requests from the Apache server. Requests are processed to a log file - format of access log is highly configurable. Location and content of the access log are controlled by the CustomLog directive (tick) (error)
Apache Avro Avro file contains data serialized in a compact binary format and schema in JSON format that defines the data types. An Avro file may also store markers if the datasets are too large and need to be split into subsets when processed (tick) (tick) Supports default compression types
COBOL Copybook is a section of code that defines the data structures of COBOL programs (tick) (error)
CSV (comma-delimited text files) Stores tabular data (numbers and text) in plain-text form (sequence of characters, with no data that has to be interpreted instead, as binary numbers). Consists of any number of records, separated by line breaks of some kind. Each record consists of fields, separated by some other character or string (most commonly a literal comma or tab) (tick) (tick) Supports default compression types
Excel Workbooks is a spreadsheet application by Microsoft (tick) (error)
Fixed Width a file with a font whose letters and characters each occupy the same amount of horizontal space (tick) (error)
HTML File file contains Hyper Text Markup Language (tick) (error)
IIS Logs (Internet Information Services) IIS is a web server application and set of feature extension modules for usage with Microsoft Windows. IIS 7.5 supports HTTP, HTTPS, FTP, FTPS, SMTP AND NNTP (tick) (error)
JSON An unordered collection of 'key:value' pairs, comma-separated and enclosured in curly braces. Keys must be strings and should be distinct from each other (tick) (error)
Key/ value pair A set of two linked data items: key and value. Key is the unique identifier for some item of data. Value is the data that is identified (tick) (error)
Log4j log file logging package written in Java (tick) (error)
MBOX a generic term for a family of related file formats used for holding collections of electronic mail messages (tick) (error)
Netfilter/ IP Tables Netfilter is the packet filtering framework inside the Linus kernel. IP tables is a user space application that allows the system administrator to configure tables provides by the Linux kernel firewall (tick)
ORC (Optimized Row Columnar) a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem (tick) (error) Only supported in conjunction with HiveSupports default compression typesExport supported only for existing partitioned Hive tables
Parquet a columnar storage format available to any project in the Apache Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language (tick) (tick) External Parquet documentationSupports default compression types
RCFile (Record Columnar File) a data placement structure that determines how to store relational tables on computer clusters (tick) (error) Export supported in conjunction with Hive only for existing partitioned Hive tables
Regex Parsable Text Files specify the file or folder, enter a Regex pattern for processing the data, and specify whether the first row contains the column headers (tick) (error)
Sequence File wit Metadata a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. (tick) (error) Supports default compression types
TDE Tableau Data Extract a compressed snapshot of data stored on disk and loaded into memory as required to render a Tableau viz (tick) (tick)
TDSX Tableau Packaged Data Source a zip file that contains the data source file (.tds), as well as any local file data such as extract files (.hyper or .tde), text files, Excel files, Access files, and local cube files (tick) (tick)
Unsecured data Such as Twitter data. Information that either doesn't have a pre-defined data model and/or doesn't fit well into relational tables. Unstructured information is typically text-heavy, but might contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated in documents (tick) (error)
XML data specify the file or folder, the root element, container element, and XPath expressions for the fields you would import to Spectrum (tick) (error) Supports default compression types

Supported File Protocols#

Protocol Input Output
File (tick) (tick)
HDFS (tick) (tick)
SSH (SCP and SFTP) (tick) (tick)
S3 (tick) (tick)
INFO

Spectrum supports Bitverse SSH Server/Client for the Windows platform. The root paths to be specified while creating the connection should look something like: /c:/mydata/folder1

Supported file compression codecs#

Codec Input Output Default compression Notes
.gz (tick) (error)
.bz2 (tick) (error)
.lzo (tick) (tick) (tick) Additional native libraries are required.
Snappy (tick) (error) (tick)
.zip (tick) (tick) (tick) Supported File Types: CSV, JSON, and XML. Size: No size limitation from Spectrum. Amount of Files: Support of zip files containing 1 file. Structure: File has to be direct in the root of the zip-file
.Z (tick) (error)

* LZO: tested with Fedora lzo.i386 v2.02-3.fc8 and http://github.com/kevinweil/hadoop-lzo (state from 2010-Jun-20)

Supported File Systems#

  1. Customer - appistry

    1. File system - storage:/
    2. Special Hadoop configuration:
    fs.storage.impl=org.apache.hadoop.fs.appistry.FabricStorageFileSystem
    fs.abs.impl=org.apache.hadoop.fs.appistry.BlockedFabricStorageFileSystem
    fs.appistry.storage.host=localhost
    fs.appistry.chunked=false
    fs.appistry.jetty.port=8085