Skip to content

Importing Data from a Database#

Requirements#

INFO

In order to import from a database the corresponding database driver has to be installed in the 'Administration' section as well as a connection has to be established.

Importing a Job from a Database#

To import data from a database:

  1. Click the "+" button and select Import Job or right-click in the File Browser and select "Create new" > "Import job".
  2. Click "Select Connection", select your database connection and click "Select". After you have selected the database connection, click "Next".
  3. Select the Table, View, or enter SQL.

    INFO: The drop-down box for tables and views has a hard limit of 1,000 entries. The schema selection option is available to filter table names for the following database types:

    • MSSsql
    • Oracle
    • Postgre
    • PostgreSQL82/Greenplum
    • Netezza

    INFO: For Azure Cosmos DB imports disable the 'Character Encoding'.

  4. View a sample of the data set to confirm this is the data source you want to use. Mark the checkboxes to select which fields to import into Spectrum. You can also specify the format for date fields. Click the help link question mark to see a complete list of supported formats. You can specify the data type using the list box.

    To enable parallel loading of the table, Spectrum uses the chosen column to segment rows into unique subsets. Good options to split the column include primary keys, auto-increment columns, or unique indexed columns. The column type should be a number or date. The import uses a single select statement if a split column isn't defined, even if the limit of mappers is configured to a higher number.

    You can see the difference in behavior within the job log.

    Single split

    INFO ... (JdbcSplitter.java:70) - number of desired splits: 4
    INFO ... (JdbcConnector.java:150) - connected to '<connection_string>' with schema set to 'null'
    WARN ... (DataDrivenSplitStrategy.java:138) - creating single split because splitColumn is set to '$NO_APPROPRIATE_ORDER_COLUMN$'
    INFO ... (JdbcSplitter.java:104) - 1 JdbcSplits:
    INFO ... (JdbcSplitter.java:106) - SELECT {"id", bytes_processed", ...} FROM ...
    

    After defining an appropriate split column for the import job, it is processed in parallel.

    Multiple splits

    INFO ... (JdbcSplitter.java:70) - number of desired splits: 4
    INFO ... (JdbcConnector.java:150) - connected to '<connection_string>' with schema set to 'null'
    INFO ... (JdbcConnector.java:356) - SELECT (SELECT MIN("id") FROM "dap_file"."id") AS MIN_VALUE, (SELECT MAX("id") FROM "dap_file"."id") AS MAX_VALUE FROM DUAL
    INFO ... (JdbcSplitter.java:104) - 4 JdbcSplits:
    INFO ... (JdbcSplitter.java:106) - SELECT {"id", bytes_processed", ...} FROM "dap_file" WHERE {...} ...
    INFO ... (JdbcSplitter.java:106) - SELECT {"id", bytes_processed", ...} FROM "dap_file" WHERE {...} ...
    INFO ... (JdbcSplitter.java:106) - SELECT {"id", bytes_processed", ...} FROM "dap_file" WHERE {...} ...
    INFO ... (JdbcSplitter.java:106) - SELECT {"id", bytes_processed", ...} FROM "dap_file" WHERE {...} ...
    
  5. Define the schedule details, specify whether to replace or append data, and click "Next".

    INFO:With a database import job, under the heading Data Retention Policy, if you select to append (with or without the time window) you have the option to select to enable an incremental mode.

    INFO: Incremental mode only imports rows that contain values in the split column greater than the maximum value from the previous import run.

  6. Add a description and click the checkbox to start the import immediately if desired Click "Save" when finished.

    INFO: You can also specify notification emails to be sent for error messages received and when a job has successfully run.

  7. Give the new import job a name and then click "Save". The import job data is now accessible from the File Browser.

Adding a SQL Statement as a Data Source#

Spectrum allows you to directly enter a complete custom SQL Query as a data source.

To add a SQL Query as a data source:

  1. When on the Data Details page, select "Enter SQL statement".
  2. Add your SQL Query to the field and click "Next".

    TIP: If the custom select query contains column aliases specified using single quotation mark, then the alias is used as column name instead the original name. For better performance, run custom SQL queries in parallel by adding a split column in the WHERE clause of the SQL statement.

    For example, if you enter the following:

    SELECT abs AS 'absolute', v1 AS 'valueOne' FROM 'schema.table' WHERE $id_col$ and field2 = 34;
    

    Column names are:

    absolute, valueOne
    

    The split column in the above SQL example is named id_col. Wrap it in the $ marks. Next, use normal SQL syntax to add additional criteria in the WHERE clause. There might be also more advanced queries possible.

    Spectrum Job History

    SELECT dap_job_configuration.id JobConfID,
           CASE dap_job_execution.job_status
           WHEN 0 THEN 'QUEUED'
           WHEN 1 THEN 'RUNNING'
           WHEN 2 THEN 'COMPLETED'
           WHEN 3 THEN 'ERROR'
           WHEN 4 THEN 'COMPLETED_WITH_WARNINGS'
           WHEN 5 THEN 'WAITING_FOR_OTHER_JOB'
           WHEN 6 THEN 'CANCELED'
           WHEN 7 THEN 'ABORTING'
           WHEN 8 THEN 'WAITING'
           ELSE 'OTHER' END JobConfStatus,
           dap_job_execution.id,
           dap_job_execution.start_time,
           dap_job_execution.stop_time,
           dap_file.name,
           dap_job_execution.user,
           TIMEDIFF(dap_job_execution.stop_time, dap_job_execution.start_time) Duration
    FROM dap_job_configuration
    INNER JOIN dap_file ON dap_job_configuration.dap_file__id = dap_file.id
    INNER JOIN dap_job_execution ON dap_job_configuration.id = dap_job_execution.dap_job_configuration__id
    
  3. Follow steps 4-7 as seen above to complete the new import job.

INFO

This open ability in allowing input of SQL is a security risk to possible data corruption. Spectrum has a basic check system to help prevent data corruption but strongly recommends the account should be READ ONLY and only have access to required tables/views unless the user has professional experience with SQL statements.

INFO

Spectrum internally sets a LIMIT which can cause an error if the user creates a MySQL query which also uses LIMIT. In order to use LIMIT to find a specified range of results, parentheses need to be added to your query. Example: '(SELECT * FROM table LIMIT 0,10)'