User Guide

About BryteFlow Blend

BryteFlow Blend allows data stored in S3 to be queried and transformed using SQL as if it were stored in a regular database.

Blend also allows the creation of a pipeline of these SQL operations which can be performed in sequence creating complex chains of transformations. This includes joining other tables, and joining the results of other queries.

The output of any query can be saved back to S3 and subsequently exported to third party systems, for example Redshift or Aurora.

Jobs

Within Blend a pipeline of operation is contained within a Job. A Job is made up of any number of Tasks which are the steps, or operations, also described as a Task Pipeline.

A Job may contain any number of mutually exclusive pipelines however typically they only contain one.

Jobs can be scheduled to occur at regular interval and then can be monitored for errors.

Tasks

Each action that can occur in the pipeline of operations is known as a Task. There are four types of task that can be performed:

  • Load – A Load task allows data to be loaded from S3 into Blend. Data loaded using a Load task is then available in the pipeline to be queried.
  • Query – Using regular ANSI SQL2 data from Load tasks, or results of other Query tasks can be transformed and queried
  • Save – A Save task allows the data from a Query task to be saved back to S3.
  • Export – The export task allows the data, once saved, to be pushed to external systems.

Example

A typical job may look something like Diagram 1. This shows a pipeline where data is being loaded, transformed in some way, saved, and then exported.

Diagram 1

However, pipelines can be complex and non-linear. Diagram 2 shows Table B and Table C being loaded, then joined, and then subsequently joined to Table A which has undergone some filtering. There are two save points in this pipeline and an export of the final result.

Diagram 2

Prerequisites

BryteFlow Blend has a number of minimum requirements for operation.

  • Real or virtual machine running Windows 10
  • Java version 7 or higher is required to be installed on machine
  • Connectivity and permissions to access AWS S3
  • Connectivity and permissions to access AWS EMR
  • Port 8080 needs to be available on system for the application to bind to
  • Open access to Port 8080 through any firewalls for interface access
  • Google Chrome to access the application interface

Installation

Step 1 – Application Files

The application can be placed in any folder, however we recommend installing to the following location:

C:\BryteFlow\Blend\

Unzip the application archive to this folder.

Step 2 – Run the Application

Navigate to the Application Folder and run the start script:

start.bat

Step 3 – Access the User Interface

BryteFlow Blend includes a web based user interface. You can access this, once the application is running, on:

http://localhost:8080/

If you are accessing remotely then use the appropriate domain and port 8080.

 

Overview of User Interface

Job List

The Job List shows a list of Jobs that have been setup within Blend. Job Items within the list show the Job Name, the current Status of the Job and an icon that indicates if the Job is enabled.

On the right hand side of the Job Item is a menu icon which provides access to configuration options for that Job.

Clicking a Job from the list will display the Task Pipeline.

Task Pipeline

The Task Pipeline shows all the Tasks that have been created for a specific Job. On a mobile device this is represented as a list of Tasks similar to the Job list, on desktop devices this is represented as a visual graph of the pipeline including the data flow and connections between tasks.

                                                                    

Mobile View Desktop View

 

Configuration

Connections

The BryteFlow Blend system utilises external systems to provide data and SQL operations. Core to this is AWS S3 for storage of data, and AWS EMR running Spark to provide SQL data manipulation. These two systems are the minimum that are required for Blend operations.

Additional systems can be added to access additional features such as exporting of data, or advanced monitoring.

The Connections view allows the configuration of connections to these external systems. The Connections view can be accessed via the global menu in the left of the title bar.

The Connections view is divided into two sections. The left panel lists the available external systems that can be configured. Clicking on an item in the list will display the properties of the system in the right hand panel.

 

AWS Configuration

BryteFlow Blend keeps up to most of the AWS features and makes the best of it. In order to get started with setting up Blend, you would need to provide your AWS details to it.

You can do it by going to the ‘Connections’ view, that can be accessed via the global menu in the left of the title bar.

This gives you link to navigate to Connections page as below :

Once you click onto ‘Connections‘, it opens up the connections page which further has a list of configurations that can be managed from here. The very first and mandatory one is the ‘AWS Location’configuration. All you need is to provide your :

  • AWS Region
  • AWS Access Key
  • AWS Secret Access Key

 

Connecting to EMR

Selecting the ‘EMR Instance‘ item in the left hand panel will allow the EMR Instance to be configured. The following fields need to be completed:

  • Name, the name for this instance. This is for you use and identification within the system and has no effect on the operation of the instance.
  • Instance, the unique AWS EMR instance ID. This can be obtained from your AWS Console.
  • Jar Location, is the full S3 location of the Blend runtime that will be used by the EMR cluster to process data. Place the supplied JAR file in an accessible S3 location.

Connecting to S3

Selecting the ‘S3 Repository‘ item in the left hand panel will allow AWS S3 access to be configured. The following fields need to be completed:

  • Name, the name for this connection. This is for you use and identification within the system.
  • Schema Type, the source of data structure information for this connection. This could be of the following types :
    • Struct metadata file
    • Parquet format
    • ORC format
  • Bucket, the S3 bucket where the data will reside.
  • S3 Folder (structure), the folder location for the structure information, needed for ‘Struct metadata’ schema type as it reads through the metadata.
  • S3 Folder (data), the folder location for the data in S3.

Connecting to Redshift

Amazon Redshift can be used as one of the destinations for transformed S3 data. Data from S3 can be exported to Redshift objects. This can be done by easy configuration. Selecting the ‘AWS Redshift‘ item in the left hand panel will allow the Redshift destination to be configured.

The following details need to be added:

  • Name, the name for this destination. This is for you use and identification within the system.
  • Host, the domain host for your Redshift instance.
  • Port, the TCP port for access to your Redshift instance.
  • Database, the database name to use for data operations.
  • Schema, the database schema to use for data operations.
  • IAM Role, the AWS IAM Role to use to access your redshift instance.
  • User Id, the AWS User Id to use to access your redshift instance.
  • Password, the password for the AWS User to access your redshift instance.
  • Append Data, indicates that data should be appended to Redshift as opposed to overwriting.

Connecting to Git

Bryteflow Blend supports connection to a Git repository to provide version control for jobs created o the system. The Git repository supported is AWS CodeCommit.

AWS CodeCommit is a version control service hosted by Amazon Web Services that you can use to privately store and manage assets (such as documents, source code, and binary files) in the cloud. For more information about AWS CodeCommit, refer to AWS documentation at https://aws.amazon.com/codecommit/ 

This connection is optional and is used only if you decide to store all jobs in a version control system.

Selecting the ‘CodeCommitRepository’ item in the left-hand panel will allow Git access to be configured. The following fields need to be completed:

  • Connection Name: The name for this connection. This is to identify this connection within the system.
  • System: The type of connection. The only option to select is CodeCommitRepostory
  • Remote repository path: Enter the path for the remote repository
  • Git User Name: This is the user id used to connect to AWS CodeCommit
  • Git Password: This is the password used to connect to AWS CodeCommit

Connecting to Cloudwatch

Bryteflow Blend supports connection to Cloudwatch Logs and Cloudwatch Metrics. This can be used to monitor the operation of Bryteflow Blend and integrate with other assets leveraging the AWS infrastructure.

AWS Cloudwatch Logs can be used to send logs of events like job completion or failure from Bryteflow Blend. Cloudwatch Logs can be used to monitor error conditions and raise alarms.

AWS Cloudwatch Metrics can be used to monitor metrics for Bryteflow Blend like time taken to execute jobs and the time jobs wait for the EMR to become available. These metrics can be monitored on the AWS dashboard and alarms can be set when the metrics cross a predefined threshold.

These connections are optional and are used only if you decide to interface with Cloudwatch Logs and Metrics.

Selecting the ‘Cloudwatch Logs’ item in the left-hand panel will allow access to Logs to be configured. The following fields need to be completed:

  • Connection Name: The name for this connection. This is to identify this connection within the system.
  • Log Group: Name of the Log Group. This has to be configured using the AWS console.
  • Log Stream: Name of the Log Stream. This has to be configured using the AWS console.

Selecting the ‘Cloudwatch Metrics’ item in the left-hand panel will allow access to Metrics to be configured. The following fields need to be completed:

  • Connection Name: The name for this connection. This is to identify this connection within the system.

 No other parameters are required.

Connecting to SNS

Bryteflow Blend supports connection to AWS SNS. This can be used to send SNS messages on significant events like job completion or failure. This can be used to coordinate other tasks on the AWS infrastructure or for generating alarms and notifications.

AWS SNS is a publish/subscribe notification service that can be used to reliable route notifications to multiple endpoints. This can be used to trigger event-driven computing and workflows.

Selecting the ‘SNS’ item in the left-hand panel will allow access to SNS to be configured. The following fields need to be completed:

  • Connection Name: The name for this connection. This is to identify this connection within the system.
  • SNS Topic: Name of the SNS Topic. This has to be configured using the AWS console.

User Login

On opening the web page to Bryteflow Blend, the user login page appears, refer to image shown below.

A single user called admin has been already created. Log in using this user and the supplied password and clicking the SIGN IN button.

To add additional users to the application, edit the config/user.txt file. Add one entry per user. A typical line is:

jsmith||ADMIN|E

The fields are:

  • User Id: The user id for logging in.
  • Password: The encrypted password. Leave this blank.
  • Role: This should be ‘USER’ or ‘ADMIN’. A user with ADMIN role will be able to access connections page and set up the configuration, whereas a user with ‘USER’ role will not be able to access connections page.
  • E/D: (enable/disable) E should be used to enable the user and D to disable the user. This field can also be used to temporarily disable a user.

When a new user is created with an empty password or the password of an existing user is removed, the system prompts for a new password on the next login by that user.

Usage

Adding a Job

A new Job can be added to Blend by clicking the floating Add Job button located in the bottom right of the Job List.

This will open the Add Job Dialog where the name for the Job can be entered, and the schedule for when the Job will be executed can be set.

The new Job will be added to the Job List and is disabled by default.

Editing a Job

A Job can be edited by clicking the menu icon to the right of the Job in the Job List and selecting the Edit option.

This will prompt the User for various actions that can be performed specifically for the Job.

On clicking ‘Edit’, opens up the ‘Edit Job’ dialog where the name of the Job, and the schedule for the Job, can be modified.

The scheduling system determines when a Job will be initiated. There are four options:

  • Minutely – Allows you to specify repeating duration in minutes on which the Job will be initiated, for example every 30 minutes,
  • Daily – Allows you to specify a specific time each day when the Job will be initiated,
  • Monthly – Allows you to specify a specific day and time each month when the Job will be initiated.
  • On SNS Trigger – Allows you to trigger the execution of the Job based on an SNS topic configured.

Deleting a Job

This Edit Job Dialog also includes a Delete Job button that will, after confirmation, delete the Job and all associated Tasks.

Adding a Task

Tasks are added to a specific Job. First you must select the Job from the Job list which will then display the Task Pipeline. You can now add a new Task by clicking the floating Add Task button on the lower right of the Task Pipeline area.

This will open the Add Task Dialog where the name of the Tasks can be entered and the type of Task selected.

Clicking the CREATE Button will close the dialog and add the task to the Task Pipeline. You may now select the Task to edit the tasks details.

Editing a Task

A task can be edited by clicking on the Task item in the Task Pipeline.

This will open the Edit Task dialog box specific to the type of Task that you are editing.

On desktop devices you are able to position the Task within the Task Pipeline by dragging the Task to a new position. This will automatically redraw any connection lines indicating the data flow.

Load Task

The Load Task allows data to be loaded into the Blend system from AWS S3. When editing a Load task you will need to choose the AWS S3 connection you want to use to load data, and you will need to provide the S3 path to the data folder and the manifest folder.

Query Task

The Query Task allows the data from Load tasks or other Query Tasks to be transformed, filtered, or joined using regular SQL. When editing a Query task you are able to use the editor to directly enter SQL statements.

Blend allows you to reference the data from other Tasks as if they were regular tables using a curly brace notation. For example, if you had a Task named “My Data” you can query the data from that Task with the following SQL:

SELECT * <br>FROM {My Data}

The Blend tool is compatible with ANSI SQL2 and the editor provides hinting for syntax and built in functions.

Save Task

The Save Task allows data to be saved from Blend back to AWS S3. When editing a Save task you will need to choose the source Task from which to obtain the data and the AWS S3 connection you want to use to save to. You will also need to provide the S3 path to the data folder and the manifest folder.

Export Task

The Export Task allows saved data to be exported to third-party systems. When editing a Save task you will need to choose the source Task from which to obtain the data and the connection you wnat to export to. Typically you will be asked for a destination name or table, however depending on the type of third-party system there may be other fields that need to be completed.

Deleting a Task

This Edit task dialogs for each Task type also includes a DELETE button that will, after confirmation, delete the Task and associated data connections.

View S3 Files

All files associated with Input and Output tasks can be viewed within Bryteflow Blend.

To open the file view panel, click on the up arrow icon on the lower right corner.

This will open the S3 file view panel.

The top bar of the panel shows tabs for all input and output tasks with an icon next to it indicating the task type. Click on a tab to see the contents of the S3 file. The title bar shows the column names. In some cases, the job may not have yet been run and the output file may not exist. In that case, a message to that effect will appear in the data window.

Click on the full-screen icon to expand the data panel to cover the whole screen.

The expand, collapse, full screen and hide icons are displayed wherever applicable to toggle between the different display modes.

Run Now

Blend Jobs can now be instantly triggered with the ‘Run Now’ feature. This is a force run option which overrides the jobs status and schedule. It allows a job to be triggered immediately irrespective of its status or scheduled date and time. This is an amazing feature which helps users to run there jobs whenever they want, no more wait time involved.

To invoke Run Now on any job, click on the menu next to the job and select Run Now.

A confirmation dialog will be presented – click Run.

The job will be run next on the EMR instance. The status will be updated on the screen similar to a normal job run.

Run Trace

Bryteflow Blend allows a job to be run for debugging purposes. A small subset of the records in input files is used to reduce the run time of the trace process and unlike a normal run, all intermediate outputs are available.

To invoke a Run Trace, on any job, click on the menu next to the job and select Run Trace.

A confirmation dialog will be presented – click Run

The status of the run will be displayed in the job status.

Once the run is complete, all data for each Input, Query and Output task will be available for viewing. Switch the display to the trace output by clicking on the view mode tab on the left bottom and selecting trace.

Once the trace mode is selected, the S3 file view is switched to the trace outputs. To view the trace outputs, click on the up arrow icon on the lower right corner.

The top bar of the panel shows tabs for all input, query and output tasks with an icon next to it indicating the task type. Click on a tab to see the contents of the trace output for that task. The title bar shows the column names.

Click on the full-screen icon to expand the data panel to cover the whole screen.

Save Job to Git

A job can be saved to Git by clicking on the menu icon to the right of the job in the job list and selecting the Save to Remote Repository option.

This will open the Save to remote Repository dialog where a comment for this version can be entered.

Retrieve Job from Git

A previously saved job can be retrieved from Git. This can be useful to transfer jobs from one instance of Bryteflow Blend to another. This is a recommended practice for releasing jobs from a development environment to a production environment.

To list all jobs in a Git repository, click on the corner menu and select Versions.

This displays a list of jobs stored in the Git repository.

Click on a job to see all versions of that job.

Select one of the versions to retrieve by clicking on the box for that version. This will show the download dialog. Select Download.

If the job does not exist, it will be downloaded and added to the list of jobs. If the job already exists, it will be replaced by the retrieved job.

Monitoring

Job Status

Jobs within Blend can be in a number of states as represented by their status. These status value are:

  • Never Run, the Job has never been executed,
  • Pending, the Job has been queued to EMR however execution has not yet commenced,
  • Running, the Task Pipeline for the Job is being executed,
  • Completed, the execution of the Task Pipeline has completed successfully,
  • Logging, there was an error and the job aborted without completing, logging information is being prepared,
  • Error, there was an error and the job aborted without completing.

The status of a job execution can be viewed in two ways.

Job List status

The current status of a Job is displayed in a secondary line of text beneath the name of the Job in the Job List.

Monitor View

If more detailed information is required, for example in the case of an error, the Monitor View can be accessed via the Job Item menu on the right side of the Job Item.

The monitor view shows the status information in addition to any error/exception information the Job has produced during execution.

Spark SQL in BryteFlow Blend

Bryteflow Blend gives the ability to execute SQL in the form of Spark SQL Query Task. Spark SQL in Blend is executed as distributed in-memory computation, on massive scale. The SQL contained in the Blend Query Task is compatible with ANSI SQL2 and the editor provides auto suggestion for syntax and built in functions.

Query Task

The SQL is contained in a Query Task. A Query Task benefices Users by giving the flexibility to filter, join and transform data from Load tasks or other Query Tasks using regular SQL. The SQL is entered in the Spark SQL Query editor, as shown in the example below :

Query Tasks enable users to reference the data from other Blend Tasks, within the Job, as regular database tables, with the use of curly brace ‘{ }’ annotation. For example, if there is a Task “MyData”, it can be queried upon in another SQL Query Task with the following SQL:

SELECT *

FROM {MyData}

As Blend uses SPARK SQL which is compatible with ANSI SQL2 Standards, it can leverage existing developer skillsets with the use of standard set of SQL functions and properties, the primary categories are as follows:

Column Functions and Properties

Below are some more functions supported in Spark SQL :

Signature Return Type Description
ascii(STRING str) INT Returns the numeric value of the first character of str.
base64(binary bin) STRING Converts the argument from binary to a base 64 string
bin(bigint a) STRING Returns the number in binary format
decode(binary bin, string charset) STRING Decodes the first argument into a String using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null.
encode(string src, string charset) BINARY Encodes the first argument into a BINARY using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null.
expr(string expr) COLUMN Parses the expression string into the column that it represents.
format_number(number x, int d) STRING Formats the number X to a format like ‘#,###,###.##’, rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part.
format_string(String Format, Arguments) STRING Formats the arguments in printf-style and returns the result as a string column.
get_json_object(string json_string, string path) STRING Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid. NOTE: The json path can only have the characters [0-9a-z_], i.e., no upper-case or special characters. Also, the keys *cannot start with a number.*
initcap(string A) STRING Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace.
instr(string str, string substr) INT Returns the position of the first occurrence of ‘substr’ in ‘str’. Returns null if either of the arguments are null and returns 0 if ‘substr’ could not be found in ‘str’. Be aware that this is not zero based. The first character in string has index 1.
isnan(string str) INT Returns true if the input is Not a Number.
isnull( a ) BOOLEAN Returns true if a is NULL and false otherwise.
length(string A) INT Returns the length of the string.
levenshtein(string A, string B) INT Returns the Levenshtein distance between two strings. For example, levenshtein(‘kitten’, ‘sitting’) results in 3.
locate(string substr, string str[, int pos]) INT Returns the position of the first occurrence of ‘substr’ in str after position pos.
lower(string A) lcase(string A) STRING Returns the string resulting from converting all characters of B to lower case. For example, lower(‘sAmple’) results in ‘sample’.
lpad(string str, int len, string pad) STRING Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
ltrim(string A) STRING Returns the string resulting from trimming spaces from the beginning (left hand side) of A. For example, ltrim(‘ sample ‘) results in ‘sample ‘.
nanvl(number col1, number col2) NUMBER Returns true if the input is Not a Number.
repeat(string str, int n) STRING “str” is the column name. Repeats each value in select column n times.
regexp_extract(string subject, string pattern, int index) STRING Returns the string extracted using the pattern. For example, regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT) STRING Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace(“sample”, “am|le”, “”) returns ‘sp.’ Note that some care is necessary in using predefined character classes: using ‘\s’ as the second argument will match the letter s; ‘\\s’ is necessary to match whitespace, etc.
reverse(string A) STRING Returns the reversed string.
rpad(string str, int len, string pad) STRING Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty ‘chr’ string, the return value is null.
rtrim(string A) STRING Returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(‘ sample ‘) results in ‘ sample’.
split(string str, string chr) ARRAY Splits str around ‘chr’ (is a regular expression)
substring_index(string A, string delim, int count) STRING Returns the substring from string A before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. Substring_index performs a case-sensitive match when searching for delim.
substr(string|binary A, int start, int len) STRING Returns the substring or slice of the byte array of A starting from start position with length len. For example, substr(‘sample’, 4, 1)
substring(string|binary A, int start, int len) STRING Returns the substring or slice of the byte array of A starting from start position with length len. For example, substr(‘sample’, 4, 1)
translate(string src, strin matchingString, string replaceString) STRING Translate any character in the src by a character in replaceString. The characters in replaceString correspond to the characters in matchingString. The translate will happen when any character in the string matches the character in the matchingString
trim(string A) STRING Returns the string resulting from trimming spaces from both ends of A. For example, trim(‘ sample ‘) results in ‘sample’
unbase64(string str) BINARY Converts the argument from a base 64 string to BINARY.
upper(string A) STRING Returns the string resulting from converting all characters of A to upper case. For example, upper(‘saMple’) results in ‘SAMPLE’.

Conditional Functions and Properties

Below are the functions supported in Spark SQL for Conditional operations :

Signature Return Type Description
coalesce(Column v1,Column v2, …)   Returns the first column’s value that is not NULL, or NULL if all v’s are NULL. This function can take any number of arguments.

Date Functions and Properties

Below are the Date Time functions supported in Spark SQL :

Signature Return Type Description
add_months(string start_date, int num_months) STRING Returns the date that is num_months after start_date . start_date is a string, date or timestamp. num_months is an integer. The time part of start_date is ignored. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date
current_date DATE Returns the current date.
current_timestamp TIMESTAMP Returns the current timestamp
date_add(date startdate, tinyint/smallint/int days) DATE Adds a number of days to startdate: date_add(‘2008-12-31’, 1) = ‘2009-01-01’.
date_sub(date startdate, tinyint/smallint/int days) DATE Subtracts a number of days to startdate: date_sub(‘2008-12-31’, 1) = ‘2008-12-30’.
date_trunc(format: String, timestamp: Column) DATE Returns timestamp truncated to the unit specified by the format.
date_format(dateExpr: Column, format: String) DATE Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
datediff(string enddate, string startdate) INT Returns the number of days from startdate to enddate: datediff(‘2009-03-01’, ‘2009-02-27’) = 2.
dayofmonth(date) INT Extracts the day of the month as an integer from a given date/timestamp/string.
dayofyear(date) INT Extracts the day of the year as an integer from a given date/timestamp/string.
dayofweek(date) INT Extracts the day of the week as an integer from a given date/timestamp/string
window(timeColumn: Column, windowDuration: String) STRING Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC.
trunc(date: Column, format: String) STRING Returns date truncated to the unit specified by the format.
from_unixtime(bigint unixtime[, string format]) STRING Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of “1970-01-01 00:00:00”.
from_utc_timestamp({any primitive type}*, string timezone) TIMESTAMP Assumes given timestamp is UTC and converts to given timezone. For example, from_utc_timestamp(‘1970-01-01 08:00:00′,’PST’) returns 1970-01-01 00:00:00.
hour(string date) INT Returns the hour of the timestamp: hour(‘2009-07-30 12:58:59′) = 12, hour(’12:58:59’) = 12.
last_day(string date) STRING Returns the last day of the month which the date belongs to. date is a string in the format ‘yyyy-MM-dd HH:mm:ss’ or ‘yyyy-MM-dd’. The time part of date is ignored.
minute(string date) INT Returns the minute of the timestamp.
month(string date) INT Returns the month part of a date or a timestamp string: month(“1970-11-01 00:00:00”) = 11, month(“1970-11-01”) = 11.
months_between(date1, date2) DOUBLE Returns number of months between dates date1 and date2. If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer. Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH:mm:ss’. The result is rounded to 8 decimal places. Example: months_between(‘1997-02-28 10:30:00’, ‘1996-10-30’) = 3.94959677
next_day(string start_date, string day_of_week) STRING Returns the first date which is later than start_date and named as day_of_week . start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored. Example: next_day(‘2015-01-14’, ‘TU’) = 2015-01-20.
quarter(date/timestamp/string) INT Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4. Example: quarter(‘2015-04-08’) = 2.
second(string date) INT Returns the second of the timestamp.
to_date(string date) DATE Converts a string formatted like yyyy-MM-dd into a Date object.
to_utc_timestamp({any primitive type}*, string timezone) TIMESTAMP Assumes given timestamp is in given timezone and converts to UTC. For example, to_utc_timestamp(‘1970-01-01 00:00:00′,’PST’) returns 1970-01-01 08:00:00.
trunc(string date, string format) STRING Returns date truncated to the unit specified by the format. Supported formats: MONTH/MON/MM, YEAR/YYYY/YY. Example: trunc(‘2015-03-17’, ‘MM’) = 2015-03-01.
unix_timestamp() BIGINT Gets current Unix timestamp in seconds. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries – this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant.
weekofyear(string date) INT Returns the week number of a timestamp string: weekofyear(“1970-11-01 00:00:00”) = 44, weekofyear(“1970-11-01”) = 44.
year(string date) INT Returns the year part of a date or a timestamp string: year(“1970-01-01 00:00:00”) = 1970, year(“1970-01-01”) = 1970.

Mathematics Functions and Properties

Below are the Mathematical functions supported in Spark SQL :

Signature Return Type Description
abs(INT a) INT Returns the absolute value.
acos(INT a) DOUBLE Returns the arccosine of a if -1<=a<=1 or NULL otherwise.
asin(INT a) DOUBLE Returns the arc sin of a if -1<=a<=1 or NULL otherwise.
atan(INT a) DOUBLE Returns the arctangent of a.
bitwiseNOT(DOUBLE a) DOUBLE Computes bitwise NOT of the given integer.
cbrt(DOUBLE a) DOUBLE Returns the cube root of a double value
ceil(DOUBLE a) BIGINT Returns the minimum BIGINT value that is equal to or greater than a.
conv(BIGINT num, INT from_base, INT to_base) STRING Converts a number from a given base to another
conv(STRING num, INT from_base, INT to_base) STRING Converts a number from a given base to another
cos(DOUBLE a), cos(DECIMAL a) DOUBLE Returns the cosine of a (a is in radians).
cosh(DOUBLE a), cosh(DECIMAL a) DOUBLE Computes the hyperbolic cosine of the given value.
exp(DOUBLE a), exp(DECIMAL a) DOUBLE Computes the exponential of the given value.
expm1(DOUBLE a), expm1(DECIMAL a) DOUBLE Computes the exponential of the given value minus 1.
factorial(INT a) BIGINT Returns the factorial of a.
floor(DOUBLE a) BIGINT Returns the maximum BIGINT value that is equal to or less than a.
greatest(t v1, t v2, …) Returns the greatest value of the list of values
hex(BIGINT a)
hex(STRING a)
hex(BINARY a)
STRING If the argument is an INT or binary, hex returns the number as a STRING in hexadecimal format. Otherwise if the number is a STRING, it converts each character into its hexadecimal representation and returns the resulting STRING.
hypot(INT A, INT B) DOUBLE Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow.
least(t v1, t v2, …) Returns the least value of the list of values
log(DOUBLE base, DOUBLE a) DOUBLE Returns the base-base logarithm of the argument a.
log10(DOUBLE a) DOUBLE Returns the base-10 logarithm of the argument a.
log1p(String columnName) DOUBLE Computes the natural logarithm of the given column plus one.
log2(DOUBLE a) DOUBLE Returns the base-2 logarithm of the argument a.
negate(INT a) INT/DOUBLE Returns -a
negate(DOUBLE a) INT/DOUBLE Returns -a
not(String) DOUBLE Inversion of boolean expression, i.e. NOT.
pow(DOUBLE a, DOUBLE p), power(DOUBLE a, DOUBLE p) DOUBLE Returns ap.
rand(), rand(INT seed) DOUBLE Returns a random number (that changes from row to row) that is distributed uniformly from 0 to 1. Specifying the seed will make sure the generated random number sequence is deterministic.
rint(T type) DOUBLE Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
round(DOUBLE a) DOUBLE Returns the rounded BIGINT value of a.
round(DOUBLE a, INT d) DOUBLE Returns a rounded to d decimal places.
sin(DOUBLE a), sin(DECIMAL a) DOUBLE Returns the sine of a (a is in radians).
sinh(DOUBLE a), sinh(DECIMAL a) DOUBLE Returns the hyperbolic sine of a (a is in radians).
tan(DOUBLE a), tan(DECIMAL a) DOUBLE Returns the tangent of a (a is in radians).
tanh(DOUBLE a), tanh(DECIMAL a) DOUBLE Returns the hyperbolic tangent of a (a is in radians).
toDegrees(String columnName) DOUBLE Returns the angle measured in radians to an approximately equivalent angle measured in degrees.
toRadians(String columnName) DOUBLE Returns the angle measured in degrees to an approximately equivalent angle measured in radians.
unhex(STRING a) BINARY Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of the number.

Aggregate Functions and Properties

Below are the Aggregate functions supported in Spark SQL :

Signature Return Type Description
MIN( expression ) INT/BIGINT /DOUBLE Finds the minimum value in a column given by  expression
MAX( expression ) INT/BIGINT /DOUBLE Finds the maximum value in a column given by  expression
AVG( expression ) INT/BIGINT /DOUBLE Computes the average value of a column given by  expression
SUM( expression ) INT/BIGINT /DOUBLE Computes the summation value of a column given by  expression
COUNT( expression ) INT Counts the rows defined by the  expression
COUNT (*) INT Counts all rows in the specified table or view
DENSE_RANK( value_list ) WITHIN GROUP (ORDER BY sort_list ) INT Generates a dense rank (no ranks are skipped) for a hypothetical row ( value_list ) in a group of rows generated by  GROUP BY
PERCENT_RANK( value_list ) WITHIN GROUP (ORDER BY sort_list ) INT Generates a relative rank for a hypothetical row by dividing that row’s rank less 1 by the number of rows in the group
RANK( value_list ) WITHIN GROUP (ORDER BY  sort_list ) INT Generates a rank for a hypothetical row ( value_list ) in a group of rows generated by  GROUP BY
approx_count_distinct(columnName: String, rsd: Double) INT/BIGINT/ DOUBLE Returns the approximate number of distinct items in a group.
countDistinct(columnName: String, columnNames: String*) INT Returns the number of distinct items in a group.
first(columnName: String) ALL Returns the first value of a column in a group.
grouping(columnName: String) BOOLEAN Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
last(columnName: String) ALL  Returns the last value of the column in a group.
mean(e: Column) INT/BIGINT/ DOUBLE Returns the average of the values in a group. Alias for avg
sumDistinct(columnName: String) INT Aggregate function: returns the sum of distinct values in the expression.
lag(e: Column, offset: Int, defaultValue: Any) ALL Returns the value that is  offset  rows before the current row, and  defaultValue  if there is less than  offset  rows before the current row. For example, an  offset  of one will return the previous row at any given point in the window partition.
lead(e: Column, offset: Int, defaultValue: Any) ALL Returns the value that is  offset  rows after the current row, and  defaultValue  if there is less than  offset  rows after the current row. For example, an  offset  of one will return the next row at any given point in the window partition.
row_number() INT Returns a sequential number starting at 1 within a window partition.

Miscellaneous Functions and Properties

Below are some more functions supported in Spark SQL :

Signature Return Type Description
crc32(string/binary) BIGINT Computes a cyclic redundancy check value for string or binary argument and returns bigint value. Example: crc32(‘ABC’) = 2743272264.
desc(String columnName) COLUMN Returns a sort expression based on the descending order of the column.
explode(Row) ROWS Creates a new row for each element with position in the given array or map column.
input_file_name FILE Creates a string column for the file name of the current Spark task.
json_tuple(jsonStr, k1, k2, …) TUPLE Creates a new row for a json column according to the given field names.
lit(literal) COL Creates a Column of literal value.
md5(string/binary) STRING Calculates an MD5 128-bit checksum for the string or binary. The value is returned as a string of 32 hex digits, or NULL if the argument was NULL. Example: md5(‘ABC’) = ‘902fbdd2b1df0c4f70b4a5d23525e932’.
monotonically_increasing_id COLUMN A column expression that generates monotonically increasing 64-bit integers.
sha1(string/binary) STRING Calculates the SHA-1 digest for string or binary and returns the value as a hex string, Example: sha1(‘ABC’) = ‘3c01bdbb26f358bab27f267924aa2c9a03fcfdb8’.
sha2(string/binary, int numBits) STRING Calculates the SHA-2 family of hash functions of a binary value and returns the value as a hex string. NumBits controls the number of bits in the message digest.
shiftLeft(int a, int b) INT Bitwise left shift, Shifts a b positions to the left.
shiftRight(int a, int b)) INT Bitwise Right shift, Shifts a b positions to the Right.
shiftrightunsigned(int a, int b) INT Bitwise unsigned right shift
signum(String columnName) COLUMN Computes the signum of the given column.

Reserved Words in Spark SQL

ALL BETWEEN DISTINCT GROUP INTO
AND BY FALSE HAVING IS
APPROXIMATE COUNT FROM INNER LIMIT
AS CACHE FIRST IF JOIN
ASC CAST LAST IN LEFT
AVG DESC FULL INSERT MAX
MIN LIKE OUTER TABLE INTERSECT
NOT RLIKE RIGHT TIMESTAMP EXCEPT
NULL UPPER SELECT TRUE SUBSTR
ON LOWER SEMI UNCACHE SUBSTRING
OR REGEXP STRING UNION SQRT
OVERWRITE ORDER SUM WHERE ABS

Licence

To get a valid license go to ‘Connections‘ view, then navigate to the ‘AWS Location‘  and thats were you find the Product ID and Licence Key details to be entered.

In case your licence is expired, you can email the “Product ID” to the Bryte support team – support@bryteflow.com  In response, you will receive the valid licence key to be applied back to the product.

This is not required if you are using BryteFlow as an AMI from the AWS Marketplace.

Below image shows where to enter the licence key for the Product to activate back again.

Release Notes

BryteFlow Blend is an ever-evolving product. We’ve been listening to user feedback and using it as a basis for improvements and additions.

Every year there are several major stable releases offering new features and fixes to Blend.

Our Release Notes contain the information you’ll need to keep updated on the latest improvements and additions. Stay up to date and get an overview of the new features included in the releases.

Latest Release

Version 2.3

Release date: 01.09.2018

What’s New in 2.3 ?

  • S3 Data View
  • Git Support
  • Run Now
  • Run Trace
  • User Login

For more information, see Release Notes Blend 2.3

Suggest Edit