Sample Data#
Data sampling is a statistical analysis technique used to select, process, and analyze a representative subset of data from your dataset. The purpose is to identify patterns and trends in the larger dataset. Data sampling allows you to work with a small, manageable amount of data. This subset of the data is used to build analytical models faster and run faster than would be possible with the total amount of data.
Overview#
The Workbench provides information whether the full dats set or sample data is used. Find the information under the Flow Area. A visual indicator, that sample data is used is the '!' icon next to the dataset node within the Flow Area.
Example for full data:
The dataset contains 25 records with a size of 4.1 kB. All records are displayed and used for creating views.
Example for sample data:
The dataset contains 20,000,000 records in total with a size of 536 MB. 100,000 of the records are used to create sample data on which transformations are calculated and displayed.
After applying an operation to a source dataset for which sample data is displayed, a notification is displayed below the Flow Area that the upstream dataset only uses sample data.
Configuring Sample Data Via Custom Property#
For optimized performance and an accurate representation of your data, Datameer provides a default initial sample of 100,000 records when you start your analysis.
Datameer Admins can configure the settings for sample data usage individually. Find more information here.
Note: Increasing the values and therefore having larger samples may lower the precessing performance.
Troubleshooting#
Sometimes it can happen that you see no records in in the data preview after applying an operation because of a sampled source dataset, e.g. you want to perform a Join from two huge datasets and the sample dataset has no matches.
In this case a notification is displayed below the Flow Area and a redirect to this troubleshooting section is provided.
We recommend to perform the following temporary workaround in this case:
-
Perform the needed operation(s) to your dataset.
-
As soon as you notice that the data preview is empty, go on performing your operations anyway.
-
(Optional) To validate your upstream dataset, you can publish the node as a table to your Snowflake.
-
Perform further operations on the upstream dataset and not on the published table to sustain the pipeline.
-
After all operations have been performed, publish your view/ table as usual to Snowflake.
Note: Since these cases require a temporary workaround we would appreciate if you contact our support in such a case.