Workbook Tips#
Find here some tips and recommendation to optimize a Workbook.
Optimizing Workbooks#
Avoiding Join of Dates
Avoiding Join of Dates#
We recommend the following: Do not join data when having 'date' data. Don't do cartesian joins.
Avoiding Large Amount of Columns Per Sheet
Avoiding Large Amount of Columns Per Sheet#
We recommend you not to keep more columns within a sheet than really needed.
Configuration of Data Retention Policy
Configuration of Data Retention Policy#
We recommend you to configure the section 'Data Retention Policy' when configuring the import of data. This regulates the options on how to store the Workbook data and minimizes the data footprint.
Breaking up the Workflow When Having High Number of Sheets and Joints
Breaking up the Workflow When Having High Number of Sheets and Joints#
Due to the execution performance, break up your workflow into. multiple Workbooks when:
- having more than 20 sheets per Workbook
- having two joins
Avoiding Unexpected Large Parquet Files
Avoiding Unexpected Large Parquet Files#
Kept sheets in a Spectrum Workbook which contains large strings (e.g. 200MB of JSON) will result in big Parquet files.
To avoid large Parquet files the following options apply:
- restructure the Workbook chain and make it a multi layer data transformation where only the required data is defined as kept/ result sheet
- increase the JVM heap space for the Spectrum job
Deletion of Non-kept Sheets
Deletion of Non-kept Sheets#
Workbook sheets that are not kept and used by a downstream Workbook or an export job reduce a Workbook's processing time and cost storage. You should only save the Workbook results for the necessary sheets in your downstream process.
To avoid a long processing time you will find an indicator about kept and therefore sheets that need to be kept while non-kept sheets can be deleted:
- find the indicator 'Consumer' on the Workbook Setting page and on the Workbook Details page
- the indicator 'Consumer' helps admins and users to identify which portions of old work are no longer needed to be kept
- delete non-kept sheets that are not used any more
Filtering Data
Filtering Data#
We recommend you to filter your data as soon as you can do it during your data editing process.
Performance of SQL Sheets
Performance of SQL Sheets#
Due to performance reasons, use subqueries in their own SQL sheet when using SQL sheets.
Usage of Descriptions
Usage of Descriptions#
Fill in the descriptions for data transformation. The annotations will help you to follow your transformation process.
Usage of Partitions
Usage of Partitions#
When importing data from type 'date', always use data partitions.
Workbook Tips for Excel Users
Workbook Tips for Excel Users#
Spectrum Workbooks have some critical differences from Excel. When you work with data in the Workbook view, you are working with a subset of the data to set up your analysis. Once you have set up your analysis, the job is run on the entire dataset on the cluster. Depending on the size of your data set, this could take awhile.
Other important things to note:
- All commands in Spectrum operate only on columns not on groups of cells.
- The concept of an Excel Vlookup is a join in Spectrum.
- Some of the Spectrum formulas are slightly different than those of Excel.
Other differences between Spectrum and Excel:
- There isn't a right-click menu available for individual cells in a Spectrum Workbook.
- Numbers align differently in Spectrum than Excel.
There are some differences in how sorting and filtering work in Spectrum.
- You can't do custom sorting in Spectrum or use case sensitivity. (Note: You can use functions for case sensitive data.)
- Sorting and filtering are separate in Spectrum.
- When creating a filter, the results are shown on the same page.
- Filtering is done using a dialog box rather than using drop down arrows in each of the column headers.
Choosing Custom Properties
Choosing Custom Properties#
This allows you to configure properties on how the job will run. For example, you can set properties such as how many other jobs can run at the same time as this one. This would allow you to minimize the traffic on the machine and allow the job to run faster.
Example
das.job.concurrent-mr-job=<numerical value>
das.job.concurrent-mr-job=6
In this example, the concurrent jobs to be run while this job is running has been set to 6.
Useful How-To Articles#
Parsing a File Upload and Analyzing Using Regex
How to Parse a FileUpload and Analyze Using Regex#
Goal#
Do you have data you would like to bring into Spectrum, but maybe the connector does not come standard? Are you interested in using advanced regex to build an analysis? In this example, you will upload a file using the Regex Parsable Text option and build an analysis using advanced regex.
Preparation#
Download the demo data: cloudera-server.log
Learn#
Loading the Example file#
- Start a "FileUpload" into your Spectrum instance so you can work with the demo data. The demo data you will use is a log file from Cloudera.
- Navigate to the file and select "Regex Parsable Text" as your data type. Choosing the Regex Parsable Text file type allows you to parse the file using your own regex.
- Click "Next".
-
Now you need to parse the file. To do this, use the following Regex expression: (The regex parse pattern needs to be precise. Ensure you use the exact parse pattern below including the spaces)
(\d{4}-\d{2}-\d{2} \d+:\d+:\d+,\d+) (\S+) ([.]) (.)
-
Now that you have the pattern, let’s break it down. Anything between () represents a column.
First you have your timestamp: 2014-03-25 18:19:36,311
(\d{4}-\d{2}-\d{2} \d+:\d+:\d+,\d+)
INFO: The \d represents Digit, so the first part (\d{4}- tells Spectrum to look for 4 digits until there is a -. Therefore, you use a combination of 4 digits-2 digits-2 digits to get the date.
Next, you break down the time with the second half. The \d+: looks for any digit with 1 or more characters before the "":"" and so on.
Now you have your next column, the log level: INFO
(\S+)
INFO: This simply looks for any string \S with 1 or more characters +.
Next, you have your thread column: [Parcel Update Service:components.LocalParcelManagerImpl@105]
([.*])
INFO: This part of the regex will look for contents in [], with 0 or more characters *.
Last, you have your message column: Updating parcel state based on local filesystem state and remote repos.
(.*)
INFO This is just a body of text, so the regex looks for any characters, any length.
Click "Next" to continue uploading the file. A preview of the parsed data is shown.
-
You will need to make sure you change all column names. Additionally, make sure you change the 'Date' column to a 'DATE' type. Use this date pattern:
yyyy-MM-dd HH:mm:ss,SSS
-
Once you have made your changes, click through until you can save your file. Name your upload "ClouderaServerLogs".
Cleaning the Logs in a Workbook#
- Add the data to a Workbook after the file upload is complete.
- Now you can clean up the data using some more regex. Right click on your source sheet tab at the bottom of the Workbook. Select to create a duplicate worksheet and duplicate all columns.
- In the duplicated sheet, select the REGEXTRACT function. The string will be your Thread column and the REGEX will be ^(\S+):
- Click ""
Extract Text and Hashtags From Twitter Tweets
How to Extract Text and Hashtags From Twitter Tweets#
Goal#
Extracting hashtags from tweets and filter the most mentioned tweets on Twitter.com
Preparation#
First, download the Twitter sample data: DatameerTweets.csv
INFO: This sample data was created by importing data via Twitter. We set up a connection using the Twitter Rest API connector and configured an import job to look for tweets that contain the word “Datameer”.
Learn#
Creating the Workbook#
- Upload the .csv file as a File Upload.
- Open the sample data you downloaded in a new Workbook.
- We are looking for tweets containing hashtags (#). The first step is to clean up the data and remove any tweets that do not contain hashtags.
- Click on the "Filter" button in the toolbar. We will create a simple filter that filters data from the text columns that contain "#" characters.
- Select the column, use the "contains" expression, and filter by the "#" sign.
- Check the box at the top to create this filter in a new sheet.
- Name your new filter sheet "TweetsContainHashtags".
Writing Regex#
Now that we have filtered down our data, we can work on extracting the hashtags with our functions.
- Creating a new sheet by clicking the "+" tab at the bottom.
- Copy over the tweet text in column A with by selecting the Fx bar and adding the formula:
COPY(#TweetsContainHashtags!text)
- Click in the next column to bring up the function wizard. Select the function REGEXTRACT.
-
The string we want to use is the new text column we just copied and the regular expression will be #\w* After clicking OK it will look like this in the formula bar at the top.
An explanation of this regular expression:
The character escapes the REGEX expression. The # is where the REGEX will begin extracting the hashtag. The \w represents “word character”, which is anything A-Z, a-z. or 0-9. The * at the end of the expression will look for 0 or more words after the hashtag.
-
Rename this worksheet "HashtagExtract". The column name for the extracted tags should be called "Tokens".
Finding the Top Hashtags#
Now we will determine which hashtags are being used most in our sample data.
- Create a new Worksheet.
-
Use the GROUPBY function on the Tokens column from your 'HashtagsExtract' sheet:
ColumnA: Group Hashtags
Example: GROUPBY(#HashtagExtract!Tokens)
-
Click in the next column and use the GROUPCOUNT function to count the occurrences of the hashtags in the data set:
ColumnB: Count Occurrences
-
The last step is to find the top 10 most frequent hashtags. Sort the data by clicking the "Sort" button in the toolbar and sort by the count based on the GROUPCOUNT column. This last step will show you the most frequent hashtags used in your Twitter tweet texts.
- Run Workbook to get full results.
Finding Outliers Using Complete Standard Deviation Analytics
How to Find Outliers Using Complete Standard Deviation Analytics#
Goal#
Once you have decided on a focus point from your data, finding outliers is a simple and repeatable process using Spectrum. This how to will show you how to find outliers in your data using Spectrum functions, including standard deviation, and the filtering tool.
Preparation#
Download the demo data: CreditCardData.csv
Learn#
Setting the Baseline#
- Upload the file in to Spectrum.
- Open a new Workbook and add the sample data.
- Create a new sheet in your Workbook and name it 'Baseline'.
-
Because we are looking for outliers by category, we will create our first GROUPBY on the VendorCategoryCode data from our source:
GROUPBY(#CreditCardData!VendorCategoryCode)
-
Finding the Average is a vital step to determining a baseline, here’s your next column:
GROUPAVERAGE(#CreditCardData!AmountUSD)
-
Finding the Standard Deviation will play a key role in our definition of an outlier:
GROUPSTDEVP(#CreditCardData!AmountUSD)
Comparing Averages#
We want to compare original data with our baseline, so we have to join those sheets in order to reference information from each; here’s how:
-
Create a joined sheet by clicking on the Join Sheets button in the tool bar.
-
Perform an inner join based on the VendorCategoryCode from the source sheet and the and the Category grouping that we created on the new Baseline sheet.
Defining Outliers#
Though there are many ways to do this including a new sheet with mathematical functions, using advanced filtering keeps your Workbooks clean and efficient. Here’s how:
-
Create a filter on the join page and use the Advanced Filter setting.
-
Open the filter dialogue and limit the results based on this simple equation:
-
Is the amount minus the average more than twice the size of the standard deviation? If so, then it's an outlier. The remaining results will represent the instances in which the filter is true!
Full expression:
#CreditCardData.AmountUSD-#Baseline.AverageSpend>2*#Baseline.SpendStandardDeviation
-
If you used our data, run the Workbook. And you can look at the results.