User Guide
  1. BryteFlow Ingest - Real-time data integration
    1. Supported Database Sources
    2. Supported Destinations
  2. BryteFlow Ingest Architecture
  3. Prerequisite
    1. Recommended Hardware configuration
    2. Prerequisites for software on the server
    3. Required Skills
  4. Launch BryteFlow Enterprise Edition from AWS Marketplace
  5. Launch BryteFlow Ingest from AWS Marketplace : Standard Edition
  6. Launch BryteFlow SAP Data Lake Builder from AWS Marketplace
  7. AWS Identity and Access Management (IAM) for BryteFlow
  8. Defining Roles for BryteFlow
  9. Environment Preparation
    1. Creating An EC2 System
    2. Recommended Network ACL Rules for EC2
    3. Outbound Connections
    4. Creating S3 Bucket
    5. Configuring EMR Cluster
    6. Additional AWS services
    7. Managing Access Keys
    8. Data Security and Encryption
    9. Testing The Connections
    10. MS SQL Server as a source connector
      1. Preparing MS SQL Server
      2. Security for MS SQL Server
      3. Verification of MS SQL Server source
      4. Data Types in MS SQL Server
    11. Oracle DB as a source connector
      1. Preparing Oracle on Amazon RDS
      2. Preparing On-premises Oracle
      3. Security for Oracle
      4. Verification of Oracle source
      5. Data Types in Oracle
    12. Preparing On-premises MySQL
    13. Preparing MySQL on Amazon RDS
    14. Security for MySQL
    15. Preparing MariaDB on Amazon RDS
    16. Preparing PostgreSQL DB
    17. Preparing Salesforce
    18. Preparing for BryteFlow Trigger solution
  10. Starting & Stopping BryteFlow Ingest
  11. Configuration of BryteFlow Ingest
  12. Dashboard
  13. Connections
    1. Source Database
      1. MS SQL Server DB
      2. Available source connectors for Oracle database
      3. Oracle DB
      4. MySQL
      5. Salesforce
      6. SAP HANA DB (All options)
      7. JDBC Full Extracts / Any Database (Full Extracts)
    2. Destination Database
      1. Load Kafka Deltas
      2. Load to Databricks using Amazon S3
    3. Destinations for Microsoft Azure cloud
      1. Load to Databricks using ADLS
    4. Destinations for Google Cloud Platform
    5. Destination File System
    6. Credentials
    7. Email Notification
  14. Data
    1. Column Type Change
    2. Partitioning
  15. Schedule
    1. Rollback
  16. Configuration
    1. Source Database
    2. Destination Database
    3. Credentials / Destination File System
    4. License
    5. High Availability / Recovery
      1. Recovery Configuration
      2. Recovery Utilisation
      3. Recovery from Faults
      4. Time to Recover
      5. Recovery Testing
    6. Recommended Risk Audit mechanisms
    7. Remote Monitoring
  17. Log
  18. Optimize usage of AWS resources / Save Cost
    1. Tagging AWS Resources
  19. Upgrade BryteFlow versions from AWS Marketplace when using AMI
  20. BryteFlow: Licencing Model
  21. BryteFlow Ingest : Pricing
  22. BryteFlow Support Information
  23. Appendix: Understanding Extraction Process
    1. Extraction Process
    2. First Extract
    3. Schedule Extract
    4. Add a new table to existing Extracts
    5. Resync data for Existing tables
  24. Appendix: Bryte Events for AWS CloudWatch Logs and SNS
  25. Appendix: Release Notes
    1. Bryteflow Ingest 3.11.4
    2. BryteFlow Ingest 3.11
    3. BryteFlow Ingest 3.10.1
    4. BryteFlow Ingest 3.10
    5. BryteFlow Ingest 3.10
    6. BryteFlow Ingest 3.9.3
    7. BryteFlow Ingest 3.9
    8. BryteFlow Ingest 3.8
    9. BryteFlow Ingest - v3.7.3
    10. BryteFlow Ingest 3.7

BryteFlow Ingest - Real-time data integration

BryteFlow Ingest is a real-time data replication software replicating data from various sources to destinations. It is a high-performance software that facilitates real-time change data capture from sources with zero load on the source systems. BryteFlow Ingest captures the changes and transfers them to the target system. It automates the creation of either an exact copy or a time series copy of the data source in the target. BryteFlow Ingest performs an initial full load from source and then incrementally merges changes to the destination of choice, the entire process being fully automated.

BryteFlow Ingest works with its companion softwares which are part of BryteFlow Product suite.

Supported Database Sources

BryteFlow Ingest supports the following database sources:

  • MS SQL Server
  • Oracle
  • MySQL
  • PostgreSQL
  • MariaDB
  • Salesforce
  • SAP S/4 HANA
  • SAP ECC
  • SAP HANA
  • API integration with files on Amazon S3
  • Any Files on Amazon S3
If you wish to source data from any source not mentioned above, please contact Bryte directly on info@bryteflow.com.

Supported Destinations

The supported destinations are as follows:

  • S3
  • Redshift
  • Snowflake
  • Athena
  • Apache Kafka
  • ADLS Gen2
  • Azure Synapse SQL
  • Microsoft SQL Server
  • Azure SQL DB
  • Oracle
  • PostgreSQL
  • Google Bigquery
  • Databricks

Looking for a different destination?

BryteFlow does custom source/destination on customer request, please contact us directly at  info@bryteflow.com.

BryteFlow Ingest Architecture

BryteFlow Ingest can replicate data from any database, any API and any flat file to Amazon S3, Redshift, Snowflake, Databricks, PostgreSQL, Google Bigquery , Apache Kafka etc. through a simple point and click interface. It is an entirely self-service and automated data replication tool.

BryteFlow offers various deployment strategies to its Customers, mentioned below:

  • Standard deployment on AWS Environment
  • High Availability deployment in an AWS Environment
  • Hybrid deployment– using On-premises and Cloud infrastructure

BryteFlow Ingest uses log-based Change Data Capture for data replication. Below is the Technical Architecture Diagram, showcasing the same for a standard setup in AWS Environment.

Below is the architecture diagram for BryteFlow Ingest in a standard deployment. Its is the reference architecture for any of the setup instructions  that is provided in this user guide. For more details on setting up any optional components please contact BryteFlow support.

Estimated Deployment time : 1 hour (~Approx)

 

 

Below is the BryteFlow Ingest architecture showcasing integration with various AWS services which are optional, in a standard deployment.

The above architecture diagram describes a Standard deployment type showcasing the below features:

  • AWS services running along with BryteFlow Ingest
  • BryteFlow Architecture recommended for a VPC in AWS.
  • Data Flow between source, AWS and destination with security and monitoring feature used.
  • Security which includes IAM is in separate group and is interfaced with BryteFlow Ingest
  • All supported destinations and AWS services are listed to which BryteFlow integrates

High Availability Architecture

Estimated Deployment time : 1 day (~Approx)

The high availability architecture explains the way BryteFlow is deployed in a multi-AZ setup. In case of any instance or AZ failures it can be auto scaled in another AZ, without incurring any data loss.

Hybrid Architecture

Estimated Deployment time : 4 hours (~Approx)

BryteFlow also offers a hybrid deployment model to its Customers, which is mix of services on-premises and in the AWS Cloud. BryteFlow Ingest can be easily setup on a Windows server which is in an on-premises environment. Whereas, all the destination end-points reside on AWS Cloud, making it a hybrid model. Its recommended to follow secure connectivity between on-premises and AWS services which can be achieved by using VPN connection or AWS Direct Connect, refer to the blog which talks about choices for hybrid cloud connectivity.

Prerequisite

Prerequisites of using Amazon Machine Image (AMI) from AWS Marketplace

Using the AMI sourced from the AWS Marketplace requires:

  • Selection of BryteFlow Ingest volume
  • Selection of EC2 instance type
  • Ensure connectivity between the server/EC2 hosting the BryteFlow Ingest software and
    • the source
    • Amazon S3
    • Amazon EMR
    • Amazon Redshift (if needed as a destination)
    • Snowflake (if needed as a destination)
    • Amazon Athena (if needed as a destination)
    • DynamoDB (if high availability option is required)

The steps to create AWS services are mentioned in detail under the section ‘ Environment Preparation‘.

Follow below steps prior to launching BryteFlow in AWS via AMI or Custom install on an EC2:

  1. Create a policy giving relevant name for EC2 i.e. “BryteFlowEc2Policy”. Refer AWS guide on creating policies.
  2. Use the policy json provided in the below section “AWS Identity and Access Management (IAM) for BryteFlow
  3. Create an IAM role “BryteFlowEc2Role”. Refer AWS guide for step-by-step instruction on creating roles .
  4. Attach the policy “BryteFlowEc2Policy” to the role.
  5. Similarly, create a Lambda policy which is required for disk checks and attach the Lambda policy json  provided in below section “Recovery from faults“.

Available options with AMI are volume based, recommended options for EC2 and EMR for each of these volumes.

 

Total Data Volume EC2 Recommended EMR Recommended 
< 100 GB t2.small 1 x m4.xlarge master node
2 x c5.xlarge task nodes
100GB – 300GB t2.medium 1 x m4.xlarge master node
2 x c5.xlarge task nodes
300GB – 1TB t2.large 1 x m4.xlarge master node
2 x c5.xlarge task nodes
> 1TB Seek expert advice from support@bryteflow.com Seek expert advice from support@bryteflow.com

 

NOTE: Evaluate the EMR configuration depending on the latency required.

These should be considered a starting point, if you have any questions please seek expert advice from support@bryteflow.com

 

System Requirement when not using Amazon Machine Image (AMI)

  • Port 8081 should be open on the server hosting the BryteFlow Ingest software
  • Google Chrome browser is required as the internet browser on the server hosting BryteFlow Ingest software
  • Java version 8 or higher is required
  • If using MS SQL Server as a source, please download and install the BCP utility
  • Ensure connectivity between the server hosting the BryteFlow Ingest software and the source, Amazon S3, Amazon EMR, Amazon Redshift and DynamoDB (if high availability option is required)

Prerequisites for software on the server

The following softwares are required to be installed on the server:

  1. The server should have 64-bit Open jdk 1.8
    https://corretto.aws/downloads/latest/amazon-corretto-8-x64-windows-jre.msi
  2. Google Chrome
  3. For Oracle database as a destination ONLY:
    • Please install Oracle client corresponding to the version of Oracle at destination database.
  4. For SQL Server sources and SQL Server destinations ONLY:  Install bcp – Microsoft Utility and below drivers:

Required Skills

BryteFlow is a very robust application which makes data replication to cloud very easy and smooth. It can deal with huge data volumes with ease and the process is all automated. The setup done in 3 easy steps. It doesn’t need highly technical resources, basic knowledge of the below is recommended yo deploy the software:

  • AWS Cloud Fundamentals
  • For RDMS endpoints, basic database skills includes writing and executing database queries.
  • Able to use Microsoft windows system 

 

Launch BryteFlow Enterprise Edition from AWS Marketplace

Steps to launch BryteFlow from AWS Marketplace: Enterprise Edition

  • Please ensure to complete the ‘Environment Preparation‘ section before proceeding to launch BryteFlow from an AMI.
  • Go to the product URL https://aws.amazon.com/marketplace/pp/B079PWMJ4B
  • Click ‘Continue to Subscribe’
  • Click ‘Continue to Configuration’. This brings up the default ‘Fulfillment Option’ with the latest software version.
  • Choose the AWS Region you would like to go for or else go by the default AWS Region that is already present in the drop-down.

Supported AWS Regions:  

BryteFlow Ingest is validated and supported in below AWS Regions, however it can be launched in all AWS regions.

  • us-east-1
  • us-east-2
  • us-west-2
  • ap-southeast-2

BryteFlow is available in ALL AWS Regions. 

Please contact BryteFlow Support for any needed support.

 

  • Click ‘Continue to Launch’
  • Choose Action ‘Launch from Website’
  • Select your EC2 instance type based on your data volume, recommendations available in the product detail page
  • Choose your VPC from the dropdown or go by the default
  • Please select the ‘Private Subnet’ under ‘Subnet Settings’. If none exists, its recommended to create one. Please follow detail AWS User Guide for Creating a Subnet.
  • Update the ‘Security Group Settings’ or create one based on BryteFlow recommended steps as below:
    • Assign a name for the security group, for eg: BryteFlowIngest
    • Enter a description of your choice
    • Add inbound rule(s) to RDP the EC2 with the Custom IP address
    • Add outbound rule(s) to allow the EC2 instance access the source db. DB ports will vary based on the source database , please add rules to allow the instance access to specific source database ports.
    • For more details, refer to BryteFlow recommendation on Network ACLs for your VPC in the below section ‘Recommended Network ACL Rules
  • Provide the Key Pair Settings by choosing an EC2 key pair of your own or ‘Create a key pair in EC2
  • Click ‘Launch’ to launch the EC2 instance.
  • The endpoint will be an EC2 instance running BryteFlow Ingest (as a windows service) on port 8081

Additional information regarding launching an EC2 instance can be found here

Any trouble launching or connecting to the EC2 instance, please refer to the troubleshooting guides below:

 

** Please note that BryteFlow Blend is a companion product to BryteFlow Ingest. In order to make the most of enterprise capabilities, first setup BryteFlow Ingest completely. Thereafter, no configuration is required in BryteFlow Blend, its all ready to go. Start with the transformations directly off  AWS S3.

Once connected to the EC2 instance:

  • Launch BryteFlow Ingest from the google chrome browser using bookmark ‘BryteFlow Ingest’
  • Or type localhost:8081 into the Chrome browser to open the BryteFlow Ingest web console
  • This will bring up a page requesting either a ‘New Instance’ or an ‘Existing Instance’
    • Click on the ‘New Instance’ button and do the setup for your environment (refer to the section regarding Configuration Of BryteFlow Ingest in this document for further details)
    • ‘Existing Instance’ should only be clicked when recovering an instance of BryteFlow Ingest (refer to the Recovery section of this document for further details)
    • Once Ingest is all setup and is replicating to the desired destination successfully
    • Launch BryteFlow Blend from the Google chrome browser using bookmark ‘BryteFlow Blend’
    • Or type localhost:8082 into the Google chrome browser to open the BryteFlow Blend web console
    • BryteFlow Blend is tied up to BryteFlow Ingest and no AWS Location configuration is required.
    • This makes users ready to start their data transformations of S3.
    • For details on Blend setup and Usage refer to the BryteFlow Blend User Guide: https://docs.bryteflow.com/Bryteflow-Blend-User-Guide/

 

 

Launch BryteFlow Ingest from AWS Marketplace : Standard Edition

Steps to launch BryteFlow Ingest from AWS Marketplace: Standard Edition

  • Please ensure to complete the ‘Environment Preparation‘ section before proceeding to launch BryteFlow from an AMI.
  • Go to the product URL https://aws.amazon.com/marketplace/pp/B01MRLEJTK
  • Click ‘Continue to Subscribe’
  • Click ‘Continue to Configuration’. This brings up the default ‘Fulfillment Option’ with the latest software version.
  • Choose the AWS Region you would like to go for or else go by the default AWS Region that is already present in the dropdown.

BryteFlow is available in ALL AWS Regions.

  • Click ‘Continue to Launch’
  • Choose Action ‘Launch from Website’
  • Select your EC2 instance type based on your data volume, recommendations available in the product detail page
  • Choose your VPC from the dropdown
  • Please select the ‘Private Subnet’ under ‘Subnet Settings’. If none exists, its recommended to create one. Please follow detail AWS User Guide for Creating a Subnet.
  • Update the ‘Security Group Settings’ or create one based on BryteFlow recommended steps as below:
    • Assign a name for the security group, for eg: BryteFlowIngest
    • Enter a description of your choice
    • Add inbound rule(s) to RDP the EC2 with the Custom IP address
    • Add outbound rule(s) to allow the EC2 instance access the source db. DB ports will vary based on the source database , please add rules to allow the instance access to specific source database ports.
    • For more details, refer to BryteFlow recommendation on Network ACLs for your VPC in the below section ‘Recommended Network ACL Rules
  • Provide the Key Pair Settings by choosing an EC2 key pair of your own or ‘Create a key pair in EC2
  • Click ‘Launch’ to launch the EC2 instance.
  • The endpoint will be an EC2 instance running BryteFlow Ingest (as a windows service) on port 8081

Additional information regarding launching an EC2 instance can be found here

Any trouble launching or connecting to the EC2 instance, please refer to the troubleshooting guides below:

Once connected to the EC2 instance:

  • Launch BryteFlow Ingest from the google chrome browser using bookmark ‘BryteFlow Ingest’
  • Or type localhost:8081 into the Chrome browser to open the BryteFlow Ingest web console
  • This will bring up a page requesting either a ‘New Instance’ or an ‘Existing Instance’
    • Click on the ‘New Instance’ button (refer to the section regarding Configuration Of BryteFlow Ingest in this document for further details)
    • ‘Existing Instance’ should only be clicked when recovering an instance of BryteFlow Ingest (refer to the Recovery section of this document for further details)

Launch BryteFlow SAP Data Lake Builder from AWS Marketplace

Steps to launch BryteFlow from AWS Marketplace: SAP Data Lake Builder

  • Please ensure to complete the ‘Environment Preparation‘ section before proceeding to launch BryteFlow from an AMI.
  • Go to the product URL
  • Click ‘Continue to Subscribe’
  • Click ‘Continue to Configuration’. This brings up the default ‘Fulfillment Option’ with the latest software version.
  • Choose the AWS Region you would like to go for or else go by the default AWS Region that is already present in the drop-down.
    • Click ‘Continue to Launch’
    • Choose Action ‘Launch from Website’
    • Select your EC2 instance type based on your data volume, recommendations available in the product detail page
    • Choose your VPC from the dropdown
    • Please select the ‘Private Subnet’ under ‘Subnet Settings’. If none exists, its recommended to create one. Please follow detail AWS User Guide for Creating a Subnet.
    • Update the ‘Security Group Settings’ or create one based on BryteFlow recommended steps as below:
      • Assign a name for the security group, for eg: BryteFlowIngest
      • Enter a description of your choice
      • Add inbound rule(s) to RDP the EC2 with the Custom IP address
      • Add outbound rule(s) to allow the EC2 instance access the source db. DB ports will vary based on the source database , please add rules to allow the instance access to specific source database ports.
      • For more details, refer to BryteFlow recommendation on Network ACLs for your VPC in the below section ‘Recommended Network ACL Rules
    • Provide the Key Pair Settings by choosing an EC2 key pair of your own or ‘Create a key pair in EC2
    • Click ‘Launch’ to launch the EC2 instance.
    • The endpoint will be an EC2 instance running BryteFlow Ingest (as a windows service) on port 8081

    Additional information regarding launching an EC2 instance can be found here

    Any trouble launching or connecting to the EC2 instance, please refer to the troubleshooting guides below:

    Once connected to the EC2 instance:

    • Launch BryteFlow Ingest from the google chrome browser using bookmark ‘BryteFlow Ingest’
    • Or type localhost:8081 into the Chrome browser to open the BryteFlow Ingest web console
    • This will bring up a page requesting either a ‘New Instance’ or an ‘Existing Instance’
      • Click on the ‘New Instance’ button (refer to the section regarding Configuration Of BryteFlow Ingest in this document for further details)
      • ‘Existing Instance’ should only be clicked when recovering an instance of BryteFlow Ingest (refer to the Recovery section of this document for further details)

     

 

AWS Identity and Access Management (IAM) for BryteFlow

AWS IAM roles are used to delegate access to the AWS resources. With IAM roles, you can establish trust relationships between your trusting account and other AWS trusted accounts. The trusting account owns the resource to be accessed and the trusted account contains the users who need access to the resource.

BryteFlow’s Recommendations: 

  • Create an IAM User for eg. ‘BryteFlow_User’. Please DO NOT use root user account to setup the application. Refer AWS guide on how to create an IAM User.
  • Create an IAM role for eg. ‘BryteFlow_EC2Role’. Refer to the AWS guide on creating IAM role.
  • Create an IAM Policy for eg. ‘BryteFlow_policy’ and assign custom policies provided below to the EC2 Role, for details on creating policy click here.
  • Instead of defining permissions for individual BryteFlow IAM users, it’s usually more convenient to create groups that relate to job functions (administrators, developers, accounting, etc.). Next, define the relevant permissions for each group. Finally, assign IAM users to those groups. All the users in an IAM group inherit the permissions assigned to the group. That way, you can make changes for everyone in a group in just one place.
  • Grant Least privilege – Its recommended to grant only minimal required permissions to the IAM role. BryteFlow User requires the basic permissions on S3, CloudWatch, Dynamodb and Redshift (if needed).
  • BryteFlow needs access to S3 , EC2 , EMR, SNS and Redshift (if needed as a destination) with the below listed minimum privileges.
        • Sample policy that is required for ‘BryteFlow’  EC2 Role is shared below:

          {
          "Version": "2012-10-17",
          "Statement": [
          {
          "Sid": "1",
          "Action": [
          "s3:DeleteObject",
          "s3:GetObject",
          "s3:ListBucket",
          "s3:PutObject"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:s3:::"
          },
          {
          "Sid": "2",
          "Action": [
          "ec2:AcceptVpcEndpointConnections",
          "ec2:AcceptVpcPeeringConnection",
          "ec2:AssociateIamInstanceProfile",
          "ec2:CreateTags",
          "ec2:DescribeTags",
          "ec2:RebootInstances"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:ec2:"
          },
          {
          "Sid": "3",
          "Action": [
          "elasticmapreduce:AddJobFlowSteps",
          "elasticmapreduce:DescribeStep",
          "elasticmapreduce:ListSteps",
          "elasticmapreduce:RunJobFlow",
          "elasticmapreduce:ListCluster",
          "elasticmapreduce:DescribeCluster"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:elasticmapreduce:::/"
          },
          {
          "Sid": "4",
          "Action": [
          "sns:Publish"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:sns:::"
          },
          {
          "Sid": "5",
          "Action": [
          "redshift:ExecuteQuery",
          "redshift:FetchResults",
          "redshift:ListTables"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:redshift:::cluster:mycluster*"
          },
          {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
          "dynamodb:CreateTable",
          "dynamodb:PutItem",
          "dynamodb:Update*",
          "dynamodb:Get*",
          "dynamodb:Scan"
          ],
          "Resource": "arn:aws:dynamodb:::table/BryteflowTable"
          },
          {
          "Sid": "6",
          "Effect": "Allow",
          "Action": [
          "secretsmanager:GetSecretValue",
          "secretsmanager:DescribeSecret",
          "secretsmanager:PutSecretValue",
          "secretsmanager:UpdateSecret"
          ],
          "Resource": "arn:aws:secretsmanager:*::secret:*"
          },
          {
          "Sid": "7",
          "Effect": "Allow",
          "Action": "secretsmanager:ListSecrets",
          "Resource": "*"
          }
          ]
          }

  • The resources used in the policy are as follows:
    • bucket_name : Data Lake bucket name, its the destination S3 bucket where the User wants the replicated data to be written to.
    • ec2_instance_id : its the instance id of the EC2 launched via AMI or the existing EC2 where BryteFlow is setup.
    • region : its the region in which the EMR cluster is created.
    • account/account_id : its the ‘BryteFlow’ user account/account_id created for BryteFlow Ingest.
    • resourceType : its the EMR resource type which should be ‘Cluster’
    • resource_id : is the EMR Cluster id for BryteFlow
    • sns_name : is the SNS topic name used by BryteFlow
    • relative-id : is the Redshift cluster identifier used in BryteFlow
  • Please NOTE: Redshift and SNS policies are optional. If Redshift is not a preferred destination for ingestion or SNS alerts are not required, while implementing please ignore the respective policy section.
  • For more details on setting up IAM roles and policies refer to AWS documentation : https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-set-up.html

Defining Roles for BryteFlow

Below are the various roles and permissions needed for launching and managing BryteFlow application.

Role Type Permissions/Policies Purpose
EC2Admin AWS Custom Role for EC2 List-DescribeInstanceStatus
Directory Service
List,Write-DescribeDirectories,CreateComputer
Systems Manager
List,Read,Write
ListAssociations,
ListInstanceAssociations,
DescribeAssociation,
DescribeDocument,
GetDeployablePatchSnapshotForInstance,
GetDocument,
GetManifest,
GetParameters,
PutComplianceItems,
PutInventory,
UpdateAssociationStatus,
UpdateInstanceAssociationStatus,
UpdateInstanceInformation
Create and Manage EC2 instance
DBAdmin AWS Custom Role cloudwatch:DeleteAlarms,
cloudwatch:Describe*,
cloudwatch:DisableAlarmActions,
cloudwatch:EnableAlarmActions,
cloudwatch:Get*,
cloudwatch:List*,
cloudwatch:PutMetricAlarm,
datapipeline:ActivatePipeline,
datapipeline:CreatePipeline,
datapipeline:DeletePipeline,
datapipeline:DescribeObjects,
datapipeline:DescribePipelines,
datapipeline:GetPipelineDefinition,
datapipeline:ListPipelines,
datapipeline:PutPipelineDefinition,
datapipeline:QueryObjects,
dynamodb:CreateTable,
dynamodb:BatchGetItem,
dynamodb:BatchWriteItem,
dynamodb:ConditionCheckItem,
dynamodb:PutItem,
dynamodb:DescribeTable,
dynamodb:DeleteItem,
dynamodb:GetItem,
dynamodb:Scan,
dynamodb:Query,
dynamodb:UpdateItem,
ec2:DescribeAccountAttributes,
ec2:DescribeAddresses,
ec2:DescribeAvailabilityZones,
ec2:DescribeInternetGateways,
ec2:DescribeSecurityGroups,
ec2:DescribeSubnets,
ec2:DescribeVpcs,
iam:ListRoles,
iam:GetRole,
kms:ListKeys,
lambda:CreateEventSourceMapping,
lambda:CreateFunction,
lambda:DeleteEventSourceMapping,
lambda:DeleteFunction,
lambda:GetFunctionConfiguration,
lambda:ListEventSourceMappings,
lambda:ListFunctions,
logs:DescribeLogGroups,
logs:DescribeLogStreams,
logs:FilterLogEvents,
logs:GetLogEvents,
logs:Create*,
logs:PutLogEvents,
logs:PutMetricFilter,
rds:*,
redshift:CreateCluster,
redshift:DeleteCluster,
redshift:ModifyCluster,
redshift:RebootCluster,
s3:CreateBucket,
sns:CreateTopic,
sns:DeleteTopic,
sns:Get*,
sns:List*,
sns:SetTopicAttributes,
sns:Subscribe,
sns:Unsubscribe
Manage DB access and priviledges
NetworkAdmin Custom Role autoscaling:Describe*,
cloudfront:ListDistributions,
cloudwatch:DeleteAlarms,
cloudwatch:DescribeAlarms,
cloudwatch:GetMetricStatistics,
cloudwatch:PutMetricAlarm,
directconnect:*,
ec2:AcceptVpcEndpointConnections,
ec2:AllocateAddress,
ec2:AssignIpv6Addresses,
ec2:AssignPrivateIpAddresses,
ec2:AssociateAddress,
ec2:AssociateDhcpOptions,
ec2:AssociateRouteTable,
ec2:AssociateSubnetCidrBlock,
ec2:AssociateVpcCidrBlock,
ec2:AttachInternetGateway,
ec2:AttachNetworkInterface,
ec2:AttachVpnGateway,
ec2:CreateCarrierGateway,
ec2:CreateCustomerGateway,
ec2:CreateDefaultSubnet,
ec2:CreateDefaultVpc,
ec2:CreateDhcpOptions,
ec2:CreateEgressOnlyInternetGateway,
ec2:CreateFlowLogs,
ec2:CreateInternetGateway,
ec2:CreateNatGateway,
ec2:CreateNetworkAcl,
ec2:CreateNetworkAclEntry,
ec2:CreateNetworkInterface,
ec2:CreateNetworkInterfacePermission,
ec2:CreatePlacementGroup,
ec2:CreateRoute,
ec2:CreateRouteTable,
ec2:CreateSecurityGroup,
ec2:CreateSubnet,
ec2:CreateTags,
ec2:CreateVpc,
ec2:CreateVpcEndpoint,
ec2:CreateVpcEndpointConnectionNotification,
ec2:CreateVpcEndpointServiceConfiguration,
ec2:CreateVpnConnection,
ec2:CreateVpnConnectionRoute,
ec2:CreateVpnGateway,
ec2:DeleteCarrierGateway,
ec2:DeleteEgressOnlyInternetGateway,
ec2:DeleteFlowLogs,
ec2:DeleteNatGateway,
ec2:DeleteNetworkInterface,
ec2:DeleteNetworkInterfacePermission,
ec2:DeletePlacementGroup,
ec2:DeleteSubnet,
ec2:DeleteTags,
ec2:DeleteVpc,
ec2:DeleteVpcEndpointConnectionNotifications,
ec2:DeleteVpcEndpointServiceConfigurations,
ec2:DeleteVpcEndpoints,
ec2:DeleteVpnConnection,
ec2:DeleteVpnConnectionRoute,
ec2:DeleteVpnGateway,
ec2:DescribeAccountAttributes,
ec2:DescribeAddresses,
ec2:DescribeAvailabilityZones,
ec2:DescribeCarrierGateways,
ec2:DescribeClassicLinkInstances,
ec2:DescribeCustomerGateways,
ec2:DescribeDhcpOptions,
ec2:DescribeEgressOnlyInternetGateways,
ec2:DescribeFlowLogs,
ec2:DescribeInstances,
ec2:DescribeInternetGateways,
ec2:DescribeKeyPairs,
ec2:DescribeMovingAddresses,
ec2:DescribeNatGateways,
ec2:DescribeNetworkAcls,
ec2:DescribeNetworkInterfaceAttribute,
ec2:DescribeNetworkInterfacePermissions,
ec2:DescribeNetworkInterfaces,
ec2:DescribePlacementGroups,
ec2:DescribePrefixLists,
ec2:DescribeRouteTables,
ec2:DescribeSecurityGroupReferences,
ec2:DescribeSecurityGroupRules,
ec2:DescribeSecurityGroups,
ec2:DescribeStaleSecurityGroups,
ec2:DescribeSubnets,
ec2:DescribeTags,
ec2:DescribeVpcAttribute,
ec2:DescribeVpcClassicLink,
ec2:DescribeVpcClassicLinkDnsSupport,
ec2:DescribeVpcEndpointConnectionNotifications,
ec2:DescribeVpcEndpointConnections,
ec2:DescribeVpcEndpointServiceConfigurations,
ec2:DescribeVpcEndpointServicePermissions,
ec2:DescribeVpcEndpointServices,
ec2:DescribeVpcEndpoints,
ec2:DescribeVpcPeeringConnections,
ec2:DescribeVpcs,
ec2:DescribeVpnConnections,
ec2:DescribeVpnGateways,
ec2:DescribePublicIpv4Pools,
ec2:DescribeIpv6Pools,
ec2:DetachInternetGateway,
ec2:DetachNetworkInterface,
ec2:DetachVpnGateway,
ec2:DisableVgwRoutePropagation,
ec2:DisableVpcClassicLinkDnsSupport,
ec2:DisassociateAddress,
ec2:DisassociateRouteTable,
ec2:DisassociateSubnetCidrBlock,
ec2:DisassociateVpcCidrBlock,
ec2:EnableVgwRoutePropagation,
ec2:EnableVpcClassicLinkDnsSupport,
ec2:ModifyNetworkInterfaceAttribute,
ec2:ModifySecurityGroupRules,
ec2:ModifySubnetAttribute,
ec2:ModifyVpcAttribute,
ec2:ModifyVpcEndpoint,
ec2:ModifyVpcEndpointConnectionNotification,
ec2:ModifyVpcEndpointServiceConfiguration,
ec2:ModifyVpcEndpointServicePermissions,
ec2:ModifyVpcPeeringConnectionOptions,
ec2:ModifyVpcTenancy,
ec2:MoveAddressToVpc,
ec2:RejectVpcEndpointConnections,
ec2:ReleaseAddress,
ec2:ReplaceNetworkAclAssociation,
ec2:ReplaceNetworkAclEntry,
ec2:ReplaceRoute,
ec2:ReplaceRouteTableAssociation,
ec2:ResetNetworkInterfaceAttribute,
ec2:RestoreAddressToClassic,
ec2:UnassignIpv6Addresses,
ec2:UnassignPrivateIpAddresses,
ec2:UpdateSecurityGroupRuleDescriptionsEgress,
ec2:UpdateSecurityGroupRuleDescriptionsIngress,
elasticbeanstalk:Describe*,
elasticbeanstalk:List*,
elasticbeanstalk:RequestEnvironmentInfo,
elasticbeanstalk:RetrieveEnvironmentInfo,
elasticloadbalancing:*,
logs:DescribeLogGroups,
logs:DescribeLogStreams,
logs:GetLogEvents,
route53:*,
route53domains:*,
sns:CreateTopic,
sns:ListSubscriptionsByTopic,
sns:ListTopics,
ec2:AcceptVpcPeeringConnection,
ec2:AttachClassicLinkVpc,
ec2:AuthorizeSecurityGroupEgress,
ec2:AuthorizeSecurityGroupIngress,
ec2:CreateVpcPeeringConnection,
ec2:DeleteCustomerGateway,
ec2:DeleteDhcpOptions,
ec2:DeleteInternetGateway,
ec2:DeleteNetworkAcl,
ec2:DeleteNetworkAclEntry,
ec2:DeleteRoute,
ec2:DeleteRouteTable,
ec2:DeleteSecurityGroup,
ec2:DeleteVolume,
ec2:DeleteVpcPeeringConnection,
ec2:DetachClassicLinkVpc,
ec2:DisableVpcClassicLink,
ec2:EnableVpcClassicLink,
ec2:GetConsoleScreenshot,
ec2:RejectVpcPeeringConnection,
ec2:RevokeSecurityGroupEgress,
ec2:RevokeSecurityGroupIngress,
ec2:CreateLocalGatewayRoute,
ec2:CreateLocalGatewayRouteTableVpcAssociation,
ec2:DeleteLocalGatewayRoute,
ec2:DeleteLocalGatewayRouteTableVpcAssociation,
ec2:DescribeLocalGatewayRouteTableVirtualInterfaceGroupAssociations,
ec2:DescribeLocalGatewayRouteTableVpcAssociations,
ec2:DescribeLocalGatewayRouteTables,
ec2:DescribeLocalGatewayVirtualInterfaceGroups,
ec2:DescribeLocalGatewayVirtualInterfaces,
ec2:DescribeLocalGateways,
ec2:SearchLocalGatewayRoutes,
s3:GetBucketLocation,
s3:GetBucketWebsite,
s3:ListBucket,
iam:GetRole,
iam:ListRoles,
iam:PassRole,
ec2:AcceptTransitGatewayVpcAttachment,
ec2:AssociateTransitGatewayRouteTable,
ec2:CreateTransitGateway,
ec2:CreateTransitGatewayRoute,
ec2:CreateTransitGatewayRouteTable,
ec2:CreateTransitGatewayVpcAttachment,
ec2:DeleteTransitGateway,
ec2:DeleteTransitGatewayRoute,
ec2:DeleteTransitGatewayRouteTable,
ec2:DeleteTransitGatewayVpcAttachment,
ec2:DescribeTransitGatewayAttachments,
ec2:DescribeTransitGatewayRouteTables,
ec2:DescribeTransitGatewayVpcAttachments,
ec2:DescribeTransitGateways,
ec2:DisableTransitGatewayRouteTablePropagation,
ec2:DisassociateTransitGatewayRouteTable,
ec2:EnableTransitGatewayRouteTablePropagation,
ec2:ExportTransitGatewayRoutes,
ec2:GetTransitGatewayAttachmentPropagations,
ec2:GetTransitGatewayRouteTableAssociations,
ec2:GetTransitGatewayRouteTablePropagations,
ec2:ModifyTransitGateway,
ec2:ModifyTransitGatewayVpcAttachment,
ec2:RejectTransitGatewayVpcAttachment,
ec2:ReplaceTransitGatewayRoute,
ec2:SearchTransitGatewayRoutes
Manage Network access and firewall settings
BryteFlowAdmin Custom Role elasticmapreduce:ListClusters,
glue:GetDatabase,
athena:StartQueryExecution,
athena:ListDatabases,
glue:GetPartitions,
glue:UpdateTable,
athena:GetQueryResults,
athena:GetDatabase,
glue:GetTable,
athena:StartQueryExecution,
glue:CreateTable,
glue:GetPartitions,
elasticmapreduce:ListSteps,
athena:GetQueryResults,
s3:ListBucket,
elasticmapreduce:DescribeCluster,
glue:GetTable,
glue:GetDatabase,
s3:PutObject,
s3:GetObject,
elasticmapreduce:DescribeStep,
athena:StopQueryExecution,
athena:GetQueryExecution,
s3:DeleteObject,
elasticmapreduce:AddJobFlowSteps,
s3:GetBucketLocation,
s3:PutObjectAcl,
secretsmanager:GetSecretValue,
secretsmanager:DescribeSecret,
secretsmanager:PutSecretValue,
secretsmanager:UpdateSecret
Able to manage BryteFlow configurations
Amazon S3 Resource Based Policy s3:PutObject,
s3:GetObject,
s3:DeleteObject,
s3:GetBucketLocation,
s3:PutObjectAclResource:
arn:aws:s3:::<bucket-name>,
arn:aws:s3:::<bucket-name>/*
To manage bucket level permissions, resource-based policy for S3 should be applied to restrict the bucket level access. The policy is attached to the bucket, but the policy controls access to both the bucket and the objects in it.
Amazon Ec2 Resource Based Policy ec2:AcceptVpcEndpointConnections,
ec2:AcceptVpcPeeringConnection,
ec2:AssociateIamInstanceProfile,
ec2:CreateTags,
ec2:DescribeTags,
ec2:RebootInstancesResource: arn:aws:ec2:<ec2_instance_id>
To manage instance level permissions, resource-based policy for EC2 should be applied to restrict the access for the EC2 instance.
AWS Marketplace AWS managed policy aws-marketplace:ViewSubscriptions,
aws-marketplace:Subscribe,
aws-marketplace:Unsubscribe,
aws-marketplace:CreatePrivateMarketplaceRequests,
aws-marketplace:ListPrivateMarketplaceRequests,
aws-marketplace:DescribePrivateMarketplaceRequests
For a user to launch BryteFlow from AWS Marketplace should have ‘AWSMarketplaceManageSubscriptions’ policy attached.

 

 

Environment Preparation

 Below is the guide provided to prepare an environment for BryteFlow in AWS :

  1. Create an AWS account: To prepare the environment for BryteFlow in AWS its required for the User to have an AWS account. If you already have an AWS account, skip to the next step. If you don’t have an AWS account to create one refer to the AWS Guide.
  2. Create an IAM User: Its recommended to create a separate user for managing all AWS services DO NOT use root user for any task. Refer to AWS guide to create an IAM admin user.

  3. Create and assign policy to the User: Use the AWS Management Console to create a customer managed policy and then attach that policy to the IAM user as per their role. The policy created allows an IAM test user to sign in directly to the AWS Management Console with assigned permissions.
  4. Signing in to AWS: 
    • As an IAM User: Sign in to AWS management console using your Account_id or Account alias in addition to user name and password. More details  available here.
    • AWS SSO: Sign in using IAM Identity center(AWS SSO) . Details on setting up AWS SSO can be found here.
  5. Create a VPC: A virtual private cloud (VPC) is a virtual network dedicated to your AWS account. It is logically isolated from other virtual networks in the AWS Cloud. You can launch your BryteFlow application and all related AWS resources, such as Amazon EC2 instances, into your VPC. For details on creating a VPC refer to AWS guide.
  6. Creating a Private Subnet in Your VPC : Considering that BryteFlow Ingest needs to be setup in Customers VPC, its recommended to create a new Private subnet within the VPC for BryteFlow. Please follow detail AWS User Guide for Creating a Subnet.
  7. Creating Security Group: security group acts as a virtual firewall for your instance to control inbound and outbound traffic. Security groups acts at the instance level, not the subnet level. Therefore, each instance in a subnet in your VPC can be assigned to a different set of security groups. If you don’t specify a particular group at launch time, the instance is automatically assigned to the default security group for the VPC which is highlnot-recommended. For each security group, you add rules that control the inbound traffic to instances, and a separate set of rules that control the outbound traffic. For more details, refer to AWS guide for security Groups.
  8. Security Group Rules: You can add or remove rules for a security group which is authorizing or revoking inbound or outbound access. A rule applies either to inbound traffic (ingress) or outbound traffic (egress). You can grant access to a specific CIDR range, or to another security group in your VPC or in a peer VPC (requires a VPC peering connection).

  9. Creating IAM Role: BryteFlow uses IAM role assigned to the Ec2 where the application is hosted. The Ec2 role needs to have all the required policies attached. To create an IAM role for BryteFlow refer to the AWS guide. Assign the required policies to the newly created IAM role.

  10. Assigning Role to Users or Group: The IAM role needs to be assigned to an AWS Directory Service user or group. The role must have a trust relationship with AWS Directory Service.  Refer to the AWS guide to assign users or groups to an existing IAM role.
  11. Creating Access Key ID and Secret Access Key, BryteFlow uses access key id and secret access key to connect to AWS services from an on-premises server. Its recommended to have a set of access keys for the BryteFlow User account. Please follow the below steps from Admin User account to create access keys:
    1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
    2. In the navigation pane, choose Users.
    3. Choose the name of the ‘separate BryteFlow‘ user whose access keys you want to manage, and then choose the Security credentials tab.
    4. In the Access keys section, to create an access key, choose Create access key. Then choose Download .csv file to save the access key ID and secret access key to a CSV file on your computer. Store the file in a secure location. You will not have access to the secret access key again after this dialog box closes. After you download the CSV file, choose Close. When you create an access key, the key pair is active by default, and you can use the pair right away.

For more information on secret keys refer to AWS documentation here.

For security reasons, when using access keys its recommended to rotate all keys after certain time, say at a period of 90 days. More details mentioned in the section Managing Access Keys

5. Creating Auto-Scaling Group, When BryteFlow Ingest needs to be deployed in a HA environment, its recommended to have your EC2 alongwith an Auto Scaling Group.

Please follow the steps here to launch the same via AWS console. When launching an Auto-Scaling group via the console below are the recommended parameters that needs to be specified :

    1. When choosing an Amazon Machine Image in step 3, please select BryteFlow Standard or Enterprise Edition AMI option based on your requirement.
    2. In ‘Configure Instance Details’ choose the instance type referring to BryteFlow’s recommendations under the ‘Prerequisite’ section
    3. For  ‘Number of Instances’, its recommended to have minimum of 2 for HA type of deployment.
    4. Under ‘Create Launch Configuration’ select the IAM Role as ‘BryteFlowEc2Role’
    5. Add Storage as per the recommendations under ‘Additional AWS Services’
    6. Choose the ‘Security Group’ created for BryteFlow in the previous steps.
    7. Choose a key pair to ‘Create Launch Configuration’
    8. Follow the remaining steps as mentioned here.

Creating An EC2 System

Please refer AWS documentation on how to create EC2 System.

Outbound Connections

BryteFlow connects to any source and destination endpoints outside of its VPC using NAT/VPN or API Gateways.

A NAT gateway is a Network Address Translation (NAT) service. You can use a NAT gateway so that instances in a private subnet can connect to services outside your VPC but external services cannot initiate a connection with those instances. For more details refer to AWS guide

To connect the VPC to remote network for enabling source/destination endpoint connections, use AWS VPN. For more details refer to AWS guide

 

 

Creating S3 Bucket

Please refer AWS documentation for creating S3 bucket.

Configuring EMR Cluster

Prior to launching an EMR cluster its recommended to verify the service limits for EMR within your AWS region.

When using BryteFlow in,

  • Standard or Hybrid environment its recommended to have 3 instances for the EMR cluster(1 master and 2 core nodes)
  • High Availability mode its recommended to have 6 instances. 3 additional are for DR mode whenever any failure occurs.

To know more about AWS service limits  and how to manage service limits click on the respective links.

Launch EMR Cluster from AWS console :

Login to your AWS account and select the correct AWS region where your S3 bucket and EC2 container are located.

  1. Click on the services drop down in the header.
  2. Select EMR under Analytics or you can search for EMR.
  3. Click on the ‘Create cluster’ button
  4. In Create Cluster – Quick Options please type in Cluster Name (Name you will identify the Cluster with)
    keep the Logging check box selected, the S3 folder will be selected by default. Launch mode should be Cluster.
  5. Under Software configuration select release emr-5.14.0 and in Applications select Core Hadoop: Hadoop 2.8.3 with Ganglia 3.7.2, Hive 2.3.2, Hue 4.1.0, Mahout 0.13.0, Pig 0.17.0, and Tez 0.8.4
  6. Hardware configuration- Please select Instance type and number of Instances you want to run.
  7. Security and access –
    Please select the EC2 key pair that you want to use with the EMR Cluster. This key will be used to SSH into the Cluster. Permission should be set to the ‘BryteFlowEc2Role’ created earlier.
  8. You can add tags to your EMR cluster and configure the tag in Ingest to avoid the re-configuration in the software in case you plan to terminate the cluster and create a new.  This helps user to keep control of their clusters and save cost on AWS resources.
  9. Click on the ‘Create cluster’ button (provisioning of a cluster can take up to 15-20 min).

Additional AWS services

As BryteFlow uses several AWS resources to fulfill user requirements, the cost of these services are separate to BryteFlow charges and are charged by AWS for your account. If you are using Snowflake as a destination the cost of Snowflake Data warehouse is separate to BryteFlow.

Below list provides list of other billable services within BryteFlow. Please use AWS Pricing calculator to estimate AWS cost of additional resources.

A sample estimate for a high availability setup with a source data volume of 100 GB is provided for reference here. Please note not all services are mandatory,  the size and no. of services will vary for each customer environment. The sample is just for reference purposes.

Please note: ALL AWS Services has a service limit, do check for sufficient resources before launching the services, if needed request for increase in quota following AWS guidelines. Please refer to the AWS guide to check the service limit corresponding to each service.

 

   Service      Mandatory Billing Type   Service Limits
AWS EC2                   Y Pay-as-you-go check EC2 quota here
Additional EBS storage attached to EC2                   Y Based on size
AWS S3                   N Pay-as-you-go check Amazon S3 quota here
AWS EMR                   N (only required for S3 as a destination ) Pay-as-you-go check EMR quota here
AWS Redshift                   N Pay-as-you-go check Amazon Redshift quota here
AWS CloudWatch Logs and metrics                   N Pay-as-you-go check EC2 quota here
AWS SNS                   N Pay-as-you-go check AWS SNS quota here
AWS Dynamo DB

(5 WCUs /5 RCUs)

                   N Pay-as-you-go check Dynamo DB quota here
Snowflake DW                    N Pay-as-you-go
AWS Lambda                    N Pay-as-you-go check AWS Lambda quota here
AWS KMS                    N Pay-as-you-go check Amazon KMS quota here
AWS Athena                    N Pay-as-you-go check Amazon Athena quota here
AWS Kinesis                    N Pay-as-you-go check Amazon Kinesis quota here

BryteFlow recommends to use below mentioned instance types for EC2 with EBS volumes attached:

EC2 Instance Type BryteFlow Standard Edition BryteFlow Enterprise Edition Recommended EBS volumes EBS Volume Type
t2.small Volume < 100 GB NA 50 GB General Purpose SSD (gp2) Volumes
t2.medium Volume >100 and < 300 GB Volume < 100 GB 100 GB General Purpose SSD (gp2) Volumes
t2.large Volume > 300 GB and < 1 TB Volume >100 and < 300 GB 500 GB General Purpose SSD (gp2) Volumes
m4.large NA Volume > 300 GB and < 1 TB  500 GB General Purpose SSD (gp2) Volumes

IMDS Settings and Recommendations

BryteFlow uses latest version of AWS SDK in every AMI release. It uses IMDSv2 for any api calls to AWS services. Its recommended to disable IMDS post deployment if required. Please refer to AWS documentation on how to do disable IMDS.
To modify the same using AWS CLI, execute the below command:
aws ec2 modify-instance-metadata-options \
–instance-id <instance id> \
–http-endpoint disabled

Managing Access Keys

BryteFlow uses Access key and secret key to authenticate to AWS services like S3, Redshift etc. It requires AWS access key id and AWS secret key for accessing the S3 and other services from on-premises. AWS IAM Roles are used when using an AMI or an EC2 server.

For security reasons, when using access keys or KMS Keys its recommended to rotate keys after certain time, say at a period of 90 days. After the new keys are generated it needs to updated in Ingest’s configuration. Please follow below mentioned steps:

  1. Open Ingest instance which needs to be updated with new key in the web browser
  2. Go to ‘Schedule’ tab. Stop the replication schedule for BryteFlow Ingest by turning ‘OFF’ the Schedule button.
  3. Go to ‘Connections’-> ‘Destination File System’
  4. Enter the new ‘Access key’ and ‘Secret access key’ in the respective text box and hit ‘Apply’
  5. Once the keys are saved, resume replication by turning ‘ON’ the schedule.

Details of key rotation can be found in AWS documentation https://docs.aws.amazon.com/kms/latest/developerguide/rotate-keys.html

IAM role for ‘BryteFlow’ should have recommended policies attached. Please refer to section ‘AWS Identity and Access Management (IAM) for BryteFlow‘ for the list of policies and permissions.

Data Security and Encryption

BryteFlow ensures various mechanisms for data security by applying encryption,

  1. With KMS, BryteFlow Ingest uses customer specified KMS key to encrypt customer data on AWS S3, Secrets Manager, DynamoDB. Configure the customer KMS id in BryteFlow, which is used to encrypt data stored on various AWS services.
  2. With AES-256, BryteFlow Ingest supports server side encryption. Amazon S3 server-side encryption uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256), to encrypt your data, which is supported by BryteFlow by default.

BryteFlow Ingest doesn’t store any data outside of Customer’s designated environment. It can store data into the below AWS services depending on the customer requirements:

  1. Amazon EC2, only for temporary staging and pipeline configuration, refer the AWS guide for enabling encryption.
  2. EBS Storage, only for temporary staging and pipeline configuration, refer the AWS guide for enabling encryption.
  3. Amazon S3, only when S3 is a destination endpoint. Refer the AWS guide for enabling encryption
  4. Amazon Redshift, only when Redshift is a destination endpoint. Refer the AWS guide for enabling encryption.
  5. Amazon Athena, only when Athena is a destination endpoint. Refer the AWS guide for enabling encryption.
  6. Amazon Aurora DB, only when S3 is a destination endpoint. Refer the AWS guide for enabling encryption.
  7. Amazon DynamoDB, when configured for High Availability. Refer the AWS guide for enabling encryption.
  8. AWS Secrets Manager, for storing all credentials setup for the pipeline. Refer the AWS guide for details on encryption.

Also, all Non-AWS destination endpoints:

  1. Snowflake – Data encryption on Snowflake
  2. Kafka – BryteFlow uses TLS encryption to load data to Kafka streams. Setup encryption in Kafka streams refer the user guide.

As a best practice BryteFlow recommends to enable encryption on all the services wherever the data is getting stored.

Key Rotation

BryteFlow recommends to rotate all keys that are being configured in Ingest, every 90 days period for security reasons. This includes all the sources and destination endpoints credentials. Below are some references for AWS services for more details.

For AWS KMS key rotation refer to the AWS guide.

For AWS Redshift key rotation refer to the AWS Guide.

Follow the recommendations as below for all Non-AWS sources and destinations:

External Applications Reference for Key Rotation
SAP SAP password rotation
Oracle Oracle Password Rotation
MS SQL Server MS SQL Server password rotation
Salesforce Salesforce password rotation
MySQL MySQL password rotation
PostgreSQL PostgreSQL password rotation

Configure Data Encryption

BryteFlow adheres to AWS recommendation of applying encryption of data at rest and in transit. It can be achieved by creating the keys and certificates that are used for encryption.

For more information, refer to AWS documentation on Providing Keys for Encrypting Data at Rest with Amazon EMR and Providing Certificates for Encrypting Data in Transit with Amazon EMR Encryption.

For Amazon Redshift destination, its recommended to enable database encryption to protect data at-rest. Refer to AWS guide for more details.

AWS Secrets Manager uses encryption via AWS KMS,  for more details refer to AWS Guide.

Specifying Encryption Options Using the AWS Console

Choose options under Encryption according to the following guidelines :

  • Choose options under At rest encryption to encrypt data stored within the file system.

Under S3 data encryption, for Encryption mode, choose a value to determine how Amazon EMR encrypts Amazon S3 data with EMRFS. BryteFlow Ingest supports the below encryption mechanism.

  • SSE-S3
  • SSE-KMS or CSE-KMS

Encryption in-transit

BryteFlow uses SSL to establish any connection(AWS services, databases etc.) for data flow, ensuring secure communication in-transit.

SSL involves complexity of managing security certificates and its important to keep the certificates active all the time for uninterrupted service.

AWS Certificate Manager handles the complexity of creating and managing public SSL/TLS certificates. Customers can have settings to get notified before the expiry date is approaching and can renew upfront, so that the services run uninterruptedly. Refer AWS guide to manage ACM here.

Storing and Managing Credentials

BryteFlow uses AWS Secrets Manager to store any/all credentials. This includes both source and destination endpoint credentials for Databases and APIs. All the secrets are encrypted using the KMS encryption. BrytFlow creates a secret in AWS secrets Manager for all credentials alongwith BryteFlow admin user details and also allows to modify the secret from the GUI as well. Go to the respective setup page in BryteFlow application to update the secret details. Its recommended to rotate all keys stored in Secrets Manager, refer to AWS guide for the same.

Testing The Connections

Verify if the connectivity to remote services is available.

To test the remote connections you would need telnet utility. Telnet has to be enabled from the control panel in Turn on Windows Feature.

  1. Go to start and then Run and type CMD, and click OK.
  2.  Type the following at the command prompt.
telnet <IP address or Hostname> Port number

For example

telnet 192.168.1.1 8081

If the connection is unsuccessful then an error will be shown.
If the command prompt window is blank only with the cursor, then the connection is successful and the service is available.

Connection error to source or destination database server.

In case of any connectivity issue to source or destination database, please check if the BryteFlow server is able to reach the remote host:port.

You can test the connection to the IP address and port using the telnet command.

telnet <IP address or Hostname> Port number

Or you can use the PowerShell command to verify the connection.

tnc <IP address or Hostname> Port number

Unable to start windows service

Error: Unable to start Windows service ‘BryteFlow Ingest’

Resolution: If Java is not installed or the system path is not updated, then Ingest service will throw an error on startup. Install Java 1.8 or add the java path to the system path. To verify the same, goto CMD and type: java -version

If the response is ‘unable to recognize command’, please check the java path in the Environment variables ‘path’ and update to correct path.

Application not able to launch

Once the BryteFlow Ingest service is installed and started, but the application is not launching on the browser

Resolution: BryteFlow application requires java 1.8 to function. Please install the correct version of java and restart the service.

If Java 11 is installed, then Ingest service will startup up, but the page will display an error message.

To verify the version, goto CMD and type: java -version

Expected result : java 1.8 <any build>

For Example: java version “1.8.0_171”
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

If the java version is higher please uninstall java and install the required version.

Grants not available on the database

Error:  ‘Cannot Open database ‘demo’ requested by the login’

Resolution: The user does not have the grants to connect to the database. Apply the correct grants to the user and try again.

Login failed for user

Error: ‘Login failed for User ‘Demo’

Resolution: The user does not exist or there is a typo in the username or the password is incorrect.

 

MS SQL Server as a source connector

Please follow below recommended steps to setup your MS SQL source connector.

Preparing MS SQL Server

SQL Server setup depends on the replication option chosen, Change Tracking OR Change Data Capture. Prerequisites for each option are described in detail. Follow the link for details.

 

Security for MS SQL Server

The BryteFlow Ingest database replication login  user should have VIEW CHANGE TRACKING permission to view the Change Tracking information.

--Review all change tracking tables that are = 1 enabled, or = 0 disabled
SELECT *
  FROM sys.all_objects
 WHERE object_id IN (SELECT object_id 
                       FROM sys.change_tracking_tables
                      WHERE is_track_columns_updated_on = 1);

Verification of MS SQL Server source

To verify if change tracking is already enabled on the database run the following SQL queries. If a row is returned then Change Tracking has been enabled for the database

SELECT *
  FROM sys.change_tracking_databases
 WHERE database_id = DB_ID('databasename');

The following SQL will list all the tables for which Change Tracking has been enabled for the selected database

USE databasename;
SELECT sys.schemas.name as schema_name,
       sys.tables.name as table_name
  FROM sys.change_tracking_tables
  JOIN sys.tables ON sys.tables.object_id = sys.change_tracking_tables.object_id
  JOIN sys.schemas ON sys.schemas.schema_id = sys.tables.schema_id;

Data Types in MS SQL Server

BryteFlow Ingest source supports most MS SQL Server data types, see the following table for the supported list:

MS SQL Server Data Types

BIGINT REAL VARCHAR (max)
BIT FLOAT NCHAR
DECIMAL DATETIME NVARCHAR (length)
INT DATETIME2 NVARCHAR (max)
MONEY SMALLDATETIME BINARY
NUMERIC (p,s) DATE VARBINARY
SMALLINT TIME VARBINARY (max)
SMALLMONEY DATETIMEOFFSET TIMESTAMP
TINYINT CHAR UNIQUEIDENTIFIER
VARCHAR HIERARCHYID XML

Oracle DB as a source connector

Please follow below recommended steps to setup your Oracle source connector.

Preparing Oracle on Amazon RDS

Enable Change Tracking for a database on Amazon Oracle RDS

  • In Oracle on Amazon RDS, the supplemental logging should be turned on at the database level.
  • Supplemental logging is required so that additional details are logged in the archive logs.
    To turn on supplemental logging at the database level, execute the following queries.

    exec
    rdsadmin.rdsadmin_util.alter_supplemental_logging('ADD','ALL');
  • To retain archived redo logs on your DB instance, execute the following command (example 24 hours)
    exec
    rdsadmin.rdsadmin_util.set_configuration('archivelog retention hours',24);
  • To turn on supplemental logging at the table level, execute the following statement
    ALTER TABLE <schema>.<tablename> ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

Preparing On-premises Oracle

Enable Change Tracking for an On-Premises Oracle Server

Execute the following queries on Oracle Server to enable change tracking.

  • Oracle database should be in ARCHIVELOG mode.
  • The supplemental logging has to be turned on at the database level. Supplemental logging is required so that additional details are logged in the archive logs.
    To turn on supplemental logging at the database level, execute the following statements:

    ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
  • Alternatively to turn on minimal database supplemental logging execute the following statements:
    ALTER DATABASE ADD SUPPLEMENTAL LOG DATA; 
    ALTER DATABASE FORCE LOGGING;
  • In Oracle, ensure that supplemental logging is turned on at the table level. To turn on supplemental logging at the table level, execute the following statement:
    ALTER TABLE <schema>.<tablename> ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

Security for Oracle

The Oracle user running BryteFlow Ingest must have the following security privileges:

SELECT access on all tables to be replicated

The following statement should return records…

SELECT * FROM  V$ARCHIVED_LOG;

If no records are returned, select access on V_$ARCHIVED_LOG should be provided, or check if the database is in ACHIVELOG mode.

The following security permissions should be assigned to the user

CREATE SESSION
SELECT access on V_$LOGMNR_CONTENTS
SELECT access on V_$LOGMNR_LOGS
SELECT access on ANY TRANSACTION
SELECT access on DBA_OBJECTS
EXECUTE access on DBMS_LOGMNR

Run the following grant statements for <user> for the above requirements

GRANT SELECT ON V_$ARCHIVED_LOG TO <user>;
GRANT SELECT ON V_$LOGMNR_CONTENTS TO <user>;
GRANT EXECUTE ON DBMS_LOGMNR TO <user>;
GRANT SELECT ON V_$LOGMNR_LOGS TO <user>;
GRANT SELECT ANY TRANSACTION TO <user>;
GRANT SELECT ON DBA_OBJECTS TO <user>;

 

Verification of Oracle source

To verify if Oracle is setup correctly for change detection execute the following queries.

Condition to be checked SQL to be executed Result expected
Is ArchiveLog mode enabled?
SELECT log_mode 
  FROM V$DATABASE;
ARCHIVELOG
Is Supplemental logging turned on at database level?
SELECT supplemental_log_data_min
  FROM V$DATABASE;
YES
Is Supplemental Logging turned on at table level?
SELECT log_group_name, 
       table_name, 
       always,
       log_group_type
  FROM dba_log_groups;
RESULT <log group name>, <table name>, ALWAYS, ALL COLUMN LOGGING

Data Types in Oracle

BryteFlow Ingest source supports most Oracle data types, see the following table for the supported list:

Oracle Data Types

BINARY_DOUBLE BINARY_FLOAT CHAR
DATE INTERVAL DAY TO SECOND LONG
LONG RAW NCHAR NUMBER
NVARCHAR RAW REF
TIMESTAMP TIMESTAMP WITH LOCAL TIME ZONE VARCHAR2

 

Preparing On-premises MySQL

To prepare MySQL for change tracking perform the following steps.

To enable binary logging, the following parameters need to be configured as below in my.ini file on MySQL on Windows or in my.cnf file on MySQL on UNIX:

Parameter Value
server_id Any value from 1.
E.g. server_id = 1
 log_bin=<path> Path to the binary log file.
E.g. log_bin = D:\MySQLLogs\BinLog
binlog_format binlog_format=row
expire_logs_days To avoid disk space issues it is strongly recommended not to use the default value (0).
E.g. expire_log_days = 4
 binlog_checksum This parameter can be set to binlog_checksum=none.
BryteFlow does support CRC32 as well
binlog_row_image binlog_row_image=full

Preparing MySQL on Amazon RDS

Enabling Change tracking on MySQL on Amazon RDS

To enable change tracking MySQL on Amazon RDS perform the following steps.

  1. In the AWS management console, for MySQL on Amazon RDS create a new DB parameter group and the following parameters should be configured as shown.
  2. The MySQL RDS DB instance should use the newly created DB parameter group for binary logging to be enabled.
binlog_format: binlog_format=row
binlog_checksum : binlog_checksum=none OR CRC32.

Security for MySQL

The Ingest user id must have the following privileges:

  1. Replication client, and Replication Slave.
  2. Select privileges on the source tables designated for replication.
  3. Execute the following queries to grant permissions to a MySQL user.
CREATE USER 'bflow_ingest_user' IDENTIFIED BY '*****';
GRANT SELECT, REPLICATION CLIENT, SHOW DATABASES ON *.* TO bflow_ingest_user;
GRANT SELECT, REPLICATION slave, SHOW DATABASES ON *.* TO bflow_ingest_user;

P.S. If the source DB type is Amazon RDS MySQL DB, please download 'mysqlbinlog.exe' and add its directory path in Windows 'Environment variable' 'PATH' on the Client machine(BryteFlow Server)

To enable change tracking MariaDB on Amazon RDS perform the following steps.

  1. In the AWS management console, for MariaDB on Amazon RDS create a new DB parameter group and the following parameters should be configured as shown.
  2. The MariaDB RDS DB instance should use the newly created DB parameter group for binary logging to be enabled.
    binlog_format: binlog_format=row
    binlog_checksum : binlog_checksum=none OR CRC32.

Preparing PostgreSQL DB

  1. Use a PostgreSQL database that is version 9.4.x or later
  2. The IP address of the BryteFlow machine must be added to the pg_hba.conf
    configuration file with the “replication” keyword in the database field.
    Example:
    host replication all 189.452.1.212/32 trust
  3. Set the following parameters and values in the postgresql.conf configuration file:Set wal_level = logicalSet max_replication_slots to a value greater than 1.

    The max_replication_slots value should be set according to the number of tasks that you want to run. For example, to run four tasks you need to set a minimum of four slots. Slots open automatically as soon as a task starts and remain open even when the task is no longer running. You need to manually delete open slots.

    Set max_wal_senders to a value greater than 1.

    The max_wal_senders parameter sets the number of concurrent tasks that can run.

    Set wal_sender_timeout =0

    The wal_sender_timeout parameter terminates replication connections that are inactive longer than the specified number of milliseconds. Although the default is 60 seconds, we recommend that you set this parameter to zero, which disables the timeout mechanism.

    Note:- After changing these parameters, a restart is needed for PostgreSQL

  4. Grant superuser permissions for the user account specified for the PostgreSQL source database. Superuser permissions are needed to access replication-specific functions in the source.
 

Preparing Salesforce account for Bryteflow Ingest

On Salesforce Change Data Capture is turned on by default, please do not turn it off.

You would need to generate a security token to be used with Bryteflow Ingest.
A security token is a case-sensitive alphanumeric key that appended to your Salesforce password.

eg. Your Salesforce password to be used with Ingest will be “<your Salesforce password ><security_token>”

A token can be generated by following these steps:
1. Log in to your salesforce account and go to My Setting > Personal > Reset my security token.
2. Click on Reset Security Token button. The token will be emailed to the email account associated with your salesforce account.

Preparing for BryteFlow Trigger solution

For the databases such as DB2, Firebird or for any RDBMS where there are no access to archive logs to get the change data, BryteFlow has the trigger option to get the change data.

For this solution there are certain prerequisites which needs to be implemented:

  1. BryteFlow replication user should have ‘select‘ access on the tables to be replicated
  2. BryteFlow replication user should have access to ‘create triggers’ to be replicated
  3. BryteFlow replication user should have access to ‘create tables’ on the source database 

Please provide relevant grants to BryteFlow replication user in order to proceed with the Trigger Solution.

Pre-Requisites for SAP HANA (Change tracking) :

1.Create a user account for BryteFlow.
CREATE USER <USERNAME> PASSWORD <PASSWORD>;

2. BryteFlow replication user should have ‘select‘ access on the tables to be replicated.
BryteFlow replication user should have access to ‘create triggers’ to be replicated

Grant below priviledges to BryteFlow user created above.
GRANT SELECT, TRIGGER ON SCHEMA <YOURSCHEMA> TO <USERNAME>;

3. BryteFlow replication user should have access to a schema where it can create a table on the source database.
This is used to store transactions for restart and recoverability

Grant below priviledges to BryteFlow user created above.
GRANT CREATE ANY ON SCHEMA <YOURSCHEMA> TO <USERNAME>;

Starting & Stopping BryteFlow Ingest

If you are using the AMI from AWS Marketplace, BryteFlow Ingest will be preinstalled as a service in Windows.

Alternatively, you can install the service by executing the following command using the Command Prompt(Admin).

  1. Navigate to the directory of the installation.
  2. service.exe --WinRun4J:RegisterService

To Start BryteFlow Ingest

  1. Start the BryteFlow Ingest service using Windows Services or  Windows Task Manager
  2. Type the URL in the Chrome browser
localhost:8081

To Stop Bryteflow Ingest

  1. Stop the BryteFlow Ingest service
  2. Replication processes can also be aborted immediately by going to Task Manager
    -> Processes -> service.exe – and selecting “End Task”

Configuration of BryteFlow Ingest

The configuration of BryteFlow Ingest is performed though the web console

  1. Type the URL in the Chrome browser
localhost:8081

The screen will then present the following tabs (left side of the screen)

  • Dashboard
  • Connections
  • Data
  • Schedule
  • Configuration
  • Log

Configure Source Databases using below API :

POST http://host:port/ingest/api/ingest?cmd=conn&conn=s

Body:

func=save&src-db=<database name>&src-host=<database host>&src-options=&src-port=<database port>&src-pwd=<database password>&src-pwd2=<database password>&src-type=rds.oracle11lm&src-uid=<database user id>&type=src

Configure Destination Databases using below API :

POST http://host:port/ingest/api/ingest?cmd=conn&conn=d

Body:

dst-bucket=<S3 bucket>&dst-db=<database name>&dst-dir=<S3 work directory>&dst-host=<Redshift host>&dst-iam_role=<IAM Role>&dst-options=&dst-port=<Redshift port>&dst-pwd=<Redshift password>&dst-pwd2=<Redshift password>&dst-type=rds.redmulti&dst-uid=<Redshift user id>&func=save&type=dst

Dashboard

The dashboard provides a central screen when the overall status of this instance of BryteFlow Ingest can be monitored

  • The Data Sources Transfer Summary shows the number of records transferred. When hourly is selected you can view the transfer statistics for 24 hours, if daily is selected the monthly statistics are displayed.
    • The pie chart displays the status of the process
      • Extraction, denoted by red
      • Loading, denoted by orange
      • Loaded, denoted by green
    • Hovering on the bar graph gives the exact number of records transferred.
  • Schedule Extract Status displays the schedule status.
  • The Configure icon will take you to the configuration of the source tables, specifically the table, type of transfer, table primary key(s) and the selection of masked columns.
  • The Dashboard provides quick access for configuration of BryteFlow Ingest (Source, Destination Database, Destination File System and Email Notification)

 

Connections

The connections tab provides access to the the following sub-tabs

  • Source Database
  • Destination Database
  • Credentials
  • Email Notification

Source Database

Configuration of MS SQL Server, Oracle, SAP (MS SQL Server), SAP (Oracle), MySQL, Salesforce, PostgreSQL, S/4 HANA, SAP ECC and others as a source database.

MS SQL Server DB

  1. In the Database Type select “Microsoft SQL Server Change Tracking” from the drop-down list.
  2. In the Database Host field please enter the IP address or hostname of the database server
  3. In the Database Port field please enter the port number on which the database server is listening on. The default port for MS SQL Server is 1433
  4. In the Database Name field please enter the name of your database e.g. BryteMSSQL
  5. Enter a valid MS SQL Server database user Id that will be used with BryteFlow Ingest. If a Windows user is required, please contact BryteFlow support info@bryteflow.com to understand how to configure this
  6. Enter Password; then confirm it by re-entering in Confirm Password
    • Please note, passwords are encrypted within BryteFlow Ingest
  7. JDBC options are optional, can be used in order to extend the JDBC URL used to access the databases.
  8. Click on the ‘Test Connection’ button to test connectivity
  9. Click on the ‘Apply’ button to confirm and save the details

Available source connectors for Oracle database

  1. Oracle Logminer: Available for all versions of Oracle database. Extracts changed data from Oracle database archived logs only.
  2. Oracle Logminer (Pluggable DB): Available for all versions of Oracle Pluggable databases. Extracts changed data from Oracle database archived logs only.
  3. Oracle Remote Logminer: Available for all versions of Oracle database. Extracts changed data from a remote Oracle database for archived logs only.
  4. Oracle Remote Logminer  (Pluggable DB): Available for all versions of Oracle Pluggable databases. Extracts changed data from a remote Oracle database for archived logs only.
  5. Oracle Continuous Logminer: Available for Oracle database versions 18c and below. Extracts changed data from Oracle database Redo logs as well as Archived logs.
  6. Oracle Continuous Logminer (Pluggable DB): Available for Oracle Pluggable database versions 18c and below. Extracts changed data from Oracle database Redo logs as well as Archived logs.
  7. Oracle 19c Continuous Logminer: Available for Oracle 19c database. Extracts changed data from Oracle database Redo logs as well as Archived logs.
  8. Oracle 19c Continuous Logminer (Pluggable DB): Available for Oracle 19c Pluggable database. Extracts changed data from Oracle database Redo logs as well as Archived logs.
  9.  Oracle Fast Logminer : Available for all versions of Oracle database. Allows multi-threaded extraction of changed data from Oracle database archived logs only.
  10. Oracle RAC : Available for all Oracle database versions. Extracts changed data from Oracle RAC archive logs, it uses Oracle logmining.
  11. Oracle RAC (Pluggable DB): Available for all Oracle Pluggable database versions. Extracts changed data from Oracle RAC archive logs, it uses Oracle logmining.
  12. Oracle (Full Extracts): Available for all versions of Oracle database. It performs ‘Full Refresh’ of data in every delta load.
  13. Oracle (Timestamps): Available for all versions of Oracle database. It gets incremental data based on the timestamp columns.

Oracle DB

  1. In the database type select the preferred connector from the drop-down list.
  2. In the database host field please enter the IP address or hostname of the database server.
  3. In the Database Port field please enter the port number on which the database server is listening on. The default port for Oracle is 1521
  4. In the database name field please enter Oracle SID
  5. Enter a valid Oracle database user id that will be used with Bryteflow Ingest.
  6. Enter Password; then confirm it by re-entering in Confirm Password
    • Please note, passwords are encrypted within BryteFlow Ingest
  7. JDBC options are optional, can be used in order to extend the JDBC URL used to access the databases.
  8. Click on the ‘Test Connection’ button to test connectivity
  9. Click on the ‘Apply’ button to confirm and save the details

Please note: When using SID to connect to a dedicated Oracle server instance use ‘:SID’  in the Database Name of source configuration.

 

Oracle Pluggable DB:

  1. In the database type select appropriate option ‘Oracle Log Miner (Pluggable DB)’ , ‘Oracle Continuous LogMiner (Pluggable DB)’ from the drop-down list.
  2. In the database host field please enter the IP address or hostname of the PDB container.
  3. In the Database Port field please enter the port number on which the pluggable database is listening on. The default port for Oracle is 1521
  4. In the database name field please enter Oracle SID
  5. Enter a valid Oracle database user id for the pluggable db that will be used with Bryteflow Ingest.
  6. Enter Password; then confirm it by re-entering in Confirm Password
    • Please note, passwords are encrypted within BryteFlow Ingest
  7. Enter the root container database name in ‘Remote DB Name’
  8. Enter root container user id in the ‘Remote User ID’
  9. Enter Password for the root user in ‘Remote Password’ ; then confirm it by re-entering in Confirm Password
    • Please note, passwords are encrypted within BryteFlow Ingest
  10. JDBC options are optional, can be used in order to extend the JDBC URL used to access the databases.
  11. Click on the ‘Test Connection’ button to test connectivity
  12. Click on the ‘Apply’ button to confirm and save the details

Please note: When using SID to connect to a dedicated Oracle server instance use ‘:SID’  in the Database Name of source configuration.

 

 

 

MySQL

  1. In the Database Type select “ MySQL 5.1 or later” from the drop-down list.
  2. In the Database Host field please enter the IP address or hostname of the database server.
  3. In the Database Port field please enter the port number on which the database server is listening on. The default port for MySQL Server is 3306.
  4. In the Database Name field please enter the name of your database e.g. BryteMySQL.
  5. Enter a valid MySQL database user Id that will be used with BryteFlow Ingest.
  6. Enter Password; then confirm it by re-entering in Confirm Password.
    • Please note, passwords are encrypted within BryteFlow Ingest.
  7. JDBC options are optional, can be used in order to extend the JDBC URL used to access the databases
  8. Click on the ‘Apply’ button to confirm and save the details.
  9. Click on the ‘Test Connection’ button to test connectivity.

Salesforce

  1. In the Database Type select “Salesforce” from the drop-down list.
  2. In the Database Name field please enter “login”.
  3. Enter a valid Salesforce user Id that will be used with BryteFlow Ingest.
  4. In Password field enter Salesforce password appended with the security token; then confirm it by re-entering in Confirm Password.
    • Please note, passwords are encrypted within BryteFlow Ingest.
  5. Click on the ‘Test Connection’ button to test connectivity.
  6. Click on the ‘Apply’ button to confirm and save the details.

SAP HANA DB(All Options)

 

  1. In the Database Type for SAP HANA (Change Tracking), SAP HANA(Full Extract), SAP HANA(Timestamps), choose the desired option from the drop-down list.
  2. In the Source Driver field please enter the SAP HANA jdbc driver as “com.sap.db.jdbc.Driver”.
  3. Enter a valid jdbc URL for the host database in the format:  jdbc:sap://<hostname>:<port no.>
  4. In the ‘User Id’ field enter the database username.
  5. In Password field enter database password, then confirm it by re-entering in Confirm Password.
    • Please note, passwords are encrypted within BryteFlow Ingest.
  6. Click on the ‘Test Connection’ button to test connectivity.
  7. Click on the ‘Apply’ button to confirm and save the details.

JDBC Full Extracts / Any Database (Full Extracts)

This connector allows all RDMS databases as a source connector to do a Full Extract and Load. It uses JDBC drivers to connect and extracts the initial data and takes it over to any BryteFLow supported destination.

Please note incremental CDC via logs is not supported for this driver. 

  1. In the database type select ‘JDBC Full Extracts or Any Database (Full Extracts)’
  2. In the source driver provide the jdbc driver name for the source database . For eg. for SAP HANA DB: ‘com.sap.db.jdbc.Driver’ ,  For Snowflake: ‘com.snowflake.client.jdbc.SnowflakeDriver’
  3. In the Source URL field please enter the JDBC connection string for the source database in the format:  jdbc:<rdbms>://<hostname>:<port no.>
  4. Enter a valid database user id that will be used with Bryteflow Ingest.
  5. Enter Password; then confirm it by re-entering in Confirm Password
    • Please note, passwords are encrypted within BryteFlow Ingest
  6. Click on the ‘Test Connection’ button to test connectivity
  7. Click on the ‘Apply’ button to confirm and save the details.

Destination Database

Available Destinations for AWS Cloud are:

  • S3 files using EMR
  • S3 files using EMR + Load to Redshift
  • S3 files using EMR + Load to Athena
  • Load to Redshift direct
  • Load to Snowflake direct
  • Load to Snowflake Multiload
  • Load to Snowflake using S3
  • Load S3 Deltas
  • Load Kafka Deltas
  • PostgreSQL
  • Load to Databricks using S3
  • Oracle

S3 files using EMR

  1. Enter Database Type: To use Amazon S3 as the destination, please use “S3 Files using EMR” from the drop-down list
  2. Enter S3 bucket name
    • eg brytetest
  3. Enter Data Directory name
    • eg data
  4. Enter working directory name under “Delta Directory”
    • eg delta
  5. Enter EMR instance id:
    • eg j-EMRINSID123 or EMR tag name like tag=brytetest
  6. Click on the ‘Test Connection’ button to test the connection details
  7. Click on the ‘Apply’ button to confirm and save the details

S3 files using EMR + Load to Redshift

  1. Enter Database Type: To use Amazon S3 and Amazon Redshift as your destination, select “
  2. S3 files using EMR + Load to Redshift” from the drop-down list.
  3. Enter S3 bucket name
    • eg brytetest
  4. Enter Data Directory name
    • eg data
  5. Enter working directory name under “Delta Directory”
    • eg delta
  6. Enter EMR instance id:
    • eg j-EMRINSID123 or EMR tag name like tag=brytetest
  7. Enter Database Host: Enter the endpoint for Amazon Redshift (excluding port)
    • eg. bryte-dc1.hdyesjdsdf.us-west-2.redshift.amazonaws.com
  8. Enter Database Port: Redshift default port being 5439
  9. Enter Database Name
    • eg dev
  10. Enter User Id: This is the Redshift user id that will load the schemas, tables, and data automatically to Redshift:
    • eg redshift_user
  11. Enter Password; re-enter to confirm
    • Please note, passwords are encrypted within BryteFlow Ingest
  12. When connecting to Redshift using IAM Role, Enter the ‘IAM Role’ and apply.
  13. JDBC options are optional, in order to extend the JDBC URL used to access the databases.
  14. Click on the ‘Test Connection’ button to test the connection details
  15. Click on the ‘Apply’ button to confirm and save the details

 

S3 files using EMR + Load to Athena

  1. Enter Database Type: To use Amazon S3 and Amazon Athena as your destination,
  2. Select “S3 files using EMR + Load to Athena” as your destination from the drop-down list.
  3. Enter Database Name :  provide the database name for Amazon Athena 
  4. Enter S3 bucket name
    • eg brytetest
  5. Enter Data Directory name
    • eg data
  6. Enter working directory name under “Delta Directory”
    • eg delta
  7. Enter EMR instance id:
    • eg j-EMRINSID123 or EMR tag name like tag=brytetest
  8. Click on the ‘Test Connection’ button to test the connection details
  9. Click on the ‘Apply’ button to confirm and save the details

Load to Redshift direct

  1. Enter ‘Database Type’ as ‘Load to Redshift Direct’
    1. Enter S3 bucket name
      • eg brytetest
    2. Enter working directory name under “Delta Directory”
      • eg delta
    3. Enter Database Host: Enter the endpoint for Amazon Redshift (excluding port)
      • eg. bryte-dc1.hdyesjdsdf.us-west-2.redshift.amazonaws.com
    4. Enter Database Port: Redshift default port being 5439
    5. Enter Database Name
      • eg dev
    6. Enter User Id: This is the Redshift user id that will load the schemas, tables, and data automatically to Redshift:
      • eg redshift_user
    7. Enter Password; re-enter to confirm
      • Please note, passwords are encrypted within BryteFlow Ingest
    8. When connecting to Redshift using IAM Role, Enter the ‘IAM Role’ and apply.
    9. Click on the ‘Test Connection’ button to test the connection details
    10. Click on the ‘Apply’ button to confirm and save the details
    11. JDBC options are optional, in order to extend the JDBC URL used to access the databases.

Load to Snowflake direct

  1. Enter ‘Database Type’ as ‘Load to Snowflake Direct’
  2. Database host is the Snowflake account URL For eg: abc123.ap-southeast-2.snowflakecomputing.com
  3. Provide Snowflake Account name under ‘Account name’
  4. Enter Snowflake Data warehouse name in ‘Warehouse Name’ field
  5. Enter Snowflake Database name in ‘Database Name’ field
  6. Enter Snowflake UserID in the Userid field
  7. Password to be configured under Password and Confirm Password section.
  8. JDBC options are optional, in order to extend the JDBC URL used to access the databases.

Load to Snowflake using S3

  1. Enter ‘Database Type’ as ‘Load to Snowflake using S3’
  2. Enter S3 bucket name
    • eg brytetest
  3. Enter working directory name under “Delta Directory”
    • eg delta
  4. Database host is the Snowflake account URL For eg: abc123.ap-southeast-2.snowflakecomputing.com
  5. Provide Snowflake Account name under ‘Account name’
  6. Enter Snowflake Data warehouse name in ‘Warehouse Name’ field
  7. Enter Snowflake Database name in ‘Database Name’ field
  8. Enter Snowflake UserID in the Userid field
  9. Password to be configured under Password and Confirm Password section.
  10. JDBC options are optional, in order to extend the JDBC URL used to access the databases.
  11. Click on the ‘Test Connection’ button to test the connection details
  12. Click on the ‘Apply’ button to confirm and save the details

 

 

 

Load Kafka Deltas

BryteFlow Ingest supports Apache Kafka as a destination. It integrates changes into Kafka location.

The incremental data is loaded as a message to Kafka Topics.

Kafka Message Keys and Partitioning  is supported in BryteFlow. The Kafka messages contain a ‘key’ in each JSON message and messages can be put into partitions for parallel consumption.

BryteFlow Ingest puts messages into Kafka Topic in JSON format by default.

The minimum size of a Kafka message sent is 4096 bytes.

Prerequisites for Kafka as a target:

  1. Kafka Host URL or Kafka Broker
  2. Kafka Client
  3. Kafka Topic Name
  4. Kafka ConsumerGroup

 

BryteFlow allows connection to SSL enabled Kafka.

Below are the required fields for Kafka SSL configuration:

  1. kafka.ssl.truststore.location : The full path to the truststore JKS file.
  2. kafka.ssl.truststore.username: The username to the truststore.
  3. kafka.ssl.truststore.password: The password to the truststore.

BryteFlow allows connection to Kafka using Kerberos authentication.

Below are the required fields for Kafka Kerberos authentication:

  1. kafka.sasl.kerberos.service.name : The Kerberos service name value.
  2. kafka.ssl.truststore.location : The full path to the truststore JKS file.
  3. kafka.ssl.truststore.password: The password to the truststore.

Load to Databricks using Amazon S3

Databricks is a unified cloud-based platform that lends itself to a variety of data goals, including data science, machine learning, and analytics, as well as data engineering, reporting, and business intelligence (BI). Because a single system can handle both, affordable data storage as expected of a data lake, and analytical capabilities as expected of a data warehouse, the Databricks Lakehouse is a much-in-demand platform that makes data access simpler, faster, and more affordable.

BryteFlow supports Databricks on AWS as a destination connector.

Configure Databricks in BryteFlow, Please follow the easy steps as below:

    1. Bucket Name: S3 bucket name where data will be loaded.
    2. Delta directory: Directory structure where staging data gets loaded.
    3. Databricks JDBC URL: Configure the jdbc URL for your Databricks warehouse.
    4.  Password: Password is the Databricks access token (PWD)
    5. Database Name: Its the metastore name: ‘hive_metastore’

Steps to generate JDBC URL for Databricks:

    1. Login to Databricks account
    2. Go to Data Science & Engineering
    3. Click on Compute
    4. Click on cluster (Which will be used)
    5. Click on Advanced options
    6. Goto JDBC/ODBC tab
    7. Copy JDBC URL For Eg: (jdbc:databricks://dbc-.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/testabcd)

Steps to generate access tokens:

    1. Login to Databricks account.
    2. Click on the user setting.
    3. Go to the Access tokens tab
    4. Click on Generate token
    5. Apply the token in ‘Password’ and ‘Confirm Password’ fields on the connection page.

Destinations for Microsoft Azure cloud

 

Available Destinations for Microsoft Azure cloud are:

  • Microsoft SQL Server databases
  • Azure SQL DB
  • Azure Synapse SQL
  • ADLS2 (Azure Data Lake Services Generation 2)
  • Snowflake on Azure

 

Microsoft SQL Server Databases

  1.  Select ‘Database Type’ as ‘Microsoft SQL Server’ from the drop-down list.
  2.  In the Database Host field please enter the IP address or hostname of the database server
  3.  In the Database Port field please enter the port number on which the database server is listening on. The default port for MS SQL Server is 1433
  4.  In the Database Name field please enter the name of your database e.g. BryteMSSQL
  5.  Enter a valid MS SQL Server database user Id that will be used with BryteFlow Ingest. If a Windows user is required, please contact BryteFlow support info@bryteflow.com to understand how to configure this
  6.  Enter Password; then confirm it by re-entering in Confirm Password
  7.  Please note, passwords are encrypted within BryteFlow Ingest
  8.  Click on the ‘Test Connection’ button to test connectivity
  9.  Click on the ‘Apply’ button to confirm and save the details.

 

 

Azure SQL DB

  1.  Select ‘Database Type’ as ‘Azure SQL DB’ from the drop-down list.
  2.  In the Database Host field please enter the IP address or hostname of the database server
  3.  In the Database Port field please enter the port number on which the database server is listening on. The default port for SQL database is 1433
  4.  In the Database Name field please enter the name of your database e.g. mysampleDB
  5.  Enter a valid Azure SQL DB database user Id that will be used with BryteFlow Ingest. If a Windows user is required, please contact BryteFlow support info@bryteflow.com to understand how to configure this
  6.  Enter Password; then confirm it by re-entering in Confirm Password
  7.  Please note, passwords are encrypted within BryteFlow Ingest
  8.  Click on the ‘Test Connection’ button to test connectivity
  9.  Click on the ‘Apply’ button to confirm and save the details.

 

Azure Synapse SQL

  1.  Select ‘Database Type’ as ‘Azure Synapse SQL‘ from the drop-down list.
  2.  In the Database Host field please enter the IP address or hostname of the database server
  3.  In the Database Port field please enter the port number on which the database server is listening on.
  4.  In the Database Name field please enter the name of your database e.g. mysampleazuredb
  5.  Enter a valid database user Id that will be used with BryteFlow Ingest. If a Windows user is required, please contact BryteFlow support info@bryteflow.com to understand how to configure this
  6.  Enter Password; then confirm it by re-entering in Confirm Password
  7.  Please note, passwords are encrypted within BryteFlow Ingest
  8.  Click on the ‘Test Connection’ button to test connectivity.
  9.  Click on the ‘Apply’ button to confirm and save the details.

 

 

Azure ADLS2

  1. Select ‘Database Type’ as ‘Azure ADLS2‘ from the drop-down list.
  2. Enter the Azure storage account name in ‘Account Name‘ field.
  3. Enter the Azure storage account container name in the ‘Container Name‘ field.
  4. Enter the Access key for the Storage account in the ‘Account Key’ field.
  5.  Please note, access keys are encrypted within BryteFlow Ingest
  6.  Click on the ‘Test Connection’ button to test connectivity.
  7.  Click on the ‘Apply’ button to confirm and save the details.

 

Load to Databricks using ADLS

BryteFlow supports Databricks on Azure as a destination connector.

Configure Databricks in BryteFlow, Please follow the description as below:

  1. DBFS Mount Point–  The mount point field value should be added here as shown below.

The DBFS mount point is to be created by creating a notebook on Azure Databricks with the following sample code and run it:

Please update the AZ URL according to the setup.

dbutils.fs.mount(source = ‘wasbs://demo@demoadls2.blob.core.windows.net‘,mount_point = ‘/mnt/blobstorage’,extra_configs = {‘fs.azure.sas.demo.demoadls2.blob.core.windows.net‘:’?sv=2022-11-02&ss=bfqt&srt=co&sp=rwdlacupyx&se=2024-07-25T20:02:42Z&st=2023-07-25T12:02:42Z&spr=https&sig=IPDudzCistFlSkKSb1t2KGneuCmEV7IQTQJwxZroRBo%3D’})dbutils.fs.ls(‘/mnt/blobstorage’)

2. JDBC URL –Can be obtained from AZ Databricks as per the steps mentioned below.

  1. Login to Databricks account
  2. Go to Data Science & Engineering
  3. Click on Compute
  4. Click on cluster (Which will be used)
  5. Click on Advanced options
  6. Goto JDBC/ODBC tab
  7. Copy JDBC URL For Eg: (jdbc:databricks://dbc-.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/testabcd)

3. Password – Please enter the Databricks Personal access token as password.

4. Database Name –Please enter the Databricks DB name, usually HIVE_METASTORE.

5. Container Name –Please enter the ADLS container name, used as a spool directory to load dat files.

6. Account Name– Please enter the ADLS Account name.

7. Account Key– Please enter the ADLS Account key field

8. Data Directory– Please enter the ADLS data directory path.

Destinations for Google Cloud Platform

Google Bigquery is the available Destination for Google Cloud Platform

Destination File System

Please Note: This section should not be referred for BryeFlow Ingest v3.10 and onward. The user interface is re-designed to give a more logical representation and is available ony to be used as a reference for the previous versions. 

To Configure S3 as the file system perform the following steps.

  • Select File System as “AWS S3 with EMR” from the drop-down.
  • In the bucket name field, enter the bucket name that you have created on Amazon S3.
  • In the Delta Directory and Data Directory field, type in the name of the folders on Amazon S3
  • Enter the Amazon EMR instance ID eg. j-1ARB3SOSWXZUZ
  • EMR instance can be specified by Instance ID (as before) or a tag ‘value’ for the tag ‘BryteflowIngest’ or a tag and value expressed as ‘tag=value’. If more than one instance fits the criteria, the first one in the list will be picked. For direct loads to Snowflake or Redshift, enter “none”.
  • In AWS Region select the correct region from the drop-down list.
  • Enter AWS access key id and AWS secret key for accessing the S3 service if installation is on-premises, else IAM roles will be used.
    • Please note, keys are encrypted within BryteFlow Ingest
  • If you are using the KMS enter the KMS key
    • Please note, keys are encrypted within BryteFlow Ingest
  • Click on the ‘Test Directory’ button to test connectivity
  • Click on the ‘Apply’ button to confirm and save the details

Credentials

Please Note: This section is available in BryteFlow Ingest v3.10 and onward. 

BryteFlow Ingest can access AWS services using IAM roles or via Access keys when used on-premises. The access method and credentials needs to be configured in Ingest. 

There are below two options:

AWS Credentials

When accessing AWS services from BryteFlow server which is on-premises please select ‘AWS Credentials’ in the File system type and provide information as below:

  1. In AWS Region select the correct region from the drop-down list.
  2. Enter AWS access key id and AWS secret key for accessing the S3 service if installation is on-premises, else IAM roles will be used.
    • Please note, keys are encrypted within BryteFlow Ingest
  3. If you are using the KMS enter the KMS key
    • Please note, keys are encrypted within BryteFlow Ingest
  4. Click on the ‘Test Directory’ button to test connectivity
  5. Click on the ‘Apply’ button to confirm and save the details

 

AWS IAM Access

Configure IAM roles in BryteFlow Ingest to access AWS services.

  1. Select ‘AWS IAM Access’ from the dropdown
  2. In AWS Region select the correct region from the drop-down list.
  3. Enter the KMS Id
  4. Click on the ‘Test Directory’ button to test connectivity
  5. Click on the ‘Apply’ button to confirm and save the details

 

Email Notification

To configure email updates to sent perform the following steps

  • Choose Mail Type: SMTP using TLS from the drop-down
  • In the Email Host, field type in the address of your SMTP server.
  • In the Email Port field, type in the port number on which the SMTP server is listening.
  • In the user id field type your complete email address from which will authenticate with the SMTP server.
  • Enter Password for the email; confirm.
    • Please note, passwords are encrypted within BryteFlow Ingest
  • In Send From, enter the email id from which the email will be send from, it has to be a valid email address on the server.
  • In Send To field enter the email address to which the notifications are sent to.
  • Click on Test Connection and then Apply to test the connection and save the settings.

Data

NOTE: Please review this section in conjunction with Appendix: Understanding Extraction Process

To select the table for transferring to destination database on Amazon Redshift and/or Amazon S3 bucket perform the following steps.

  1. Expand the Database.
  2. Browse to the table you want to be synced with Amazon Redshift or Amazon S3.
  3. Select the checkbox next to the table and then click on the table.
  4. On the right-hand side pane, select the type of transfer for the table i.e. By Primary Key or By Primary Key with History. With the Primary Key option, the table is replicated like for like to the destination. With the Primary Key with History option, the table is replicated as time series data with very change recorded with Slowly Changing Dimension type2 history (aka point in time)
  5. In the Primary Key column, select the Primary Key for the table by checking the checkbox next to the column name.
  6. You can also mask a column by checking the checkbox. By masking a column, the selected column will not be transferred to the destination.
  7. Type Change can be specified for the columns that needs a datatype conversion by selecting the ‘TChange’ checkbox against the column(s).
  8. Click on the ‘Apply’ button to confirm and save the details

 

 

This process of selecting tables, configuring primary keys and mask columns should be repeated for each of the tables. Once complete the next step is to…

  1. Navigate to the Schedule tab
  2. Click on the ‘Full Extract’ button to initiate the Initial Extract and Load process

Column Type Change

This feature is mostly used in SAP environments. This is to allow Type Change of column/fields from native character or numeric format to Integer, Long, Float, Date and Timestamp. 

BryteFlow Ingest automatically convert data types during data replication or CDC to the destination formats.

The destination data types are:

INTEGER @I
LONG @L
FLOAT @F
DATE (including format clause e.g. yyyyMMdd) @D(format)
TIMESTAMP (including format clause e.g. yyyy-MM-dd HH:mm:ss) @T(format)

Please note: The (format) part can be almost anything based on the value in the source column. 

Partitioning

Amazon S3 And Amazon Redshift
Partitioning can dramatically improve efficiency and performance, it can be set up when replicating to S3 (data is partitioned in folders) and/or Redshift (data is partitioned into tables).  The partitioning string is entered into the Partitioning folder field.  The format for partitioning is as follows

/@<column index>(<partition prefix>=<partition_format>)

Column Index

To build a partitioning folder structure the column index (starting from 1) of the column(s) to be used in the partition need to be known, in this simple table there are 3 columns…

  • customer.contact_id would be column index 1
  • customer.fullname would be column index 2
  • customer.email would be column index 3

 

Partition Prefix (optional)

Each partition can be prefixed with a named fixed string. The last character the Partition Prefix can be set to ‘=’, ending with ‘=’ is useful when creating partitions on S3 as this facilitates the automated build/recovery of partitions (see below).

  • The partition prefix string should be in lower case
  • The partition prefix string should not be the same as any of the existing column names

 

An example for partitioning on the first letter of of column 2 (fullname in this case) is as follows:

/@2(fullname_start=%1s)

Refer to the MSCK REPAIR TABLE command in AWS Athena documentation. A lower case partition prefix is recommended as an upper/mixed case partition prefix can result in issues when using Athena.

--Builds/recovers partitions and data associated with partitions 
MSCK REPAIR TABLE <athena_table_name>;

 

Once the MSCK REPAIR TABLE <athena_table_name>; has been executed all data will be added to the relevant partitions….any new data will be automatically added to the existing partitions. However if new partitions are created by BryteFlow Ingest the MSCK REPAIR TABLE <athena_table_name>; command will have to be re-executed to make the data available for query purposes in the Athena table.

 

Format

The format is applied to the column index specified above, for example to partition the data by year (on a date column) you’d use the format %y, to partition by the 24 hour format of time you’d use the format %H.

Partition Examples

Example 1: Year
Assuming Column Index 7 was a date field…

/@7(year=%y)

This would create partition folders such as

  • year=2016
  • year=2017
  • year=2018
  • year=2019

 

Example 2: YearMonthDay
Assuming Column Index 7 was a date field…

/@7(%y%M%d)

This would create partition folders such as

  • 20190101
  • 20190102
  • 20190103
  • 20190104

 

Example 3: yyyymmdd=YearMonthDay
Assuming Column Index 7 was a date field…

/@7(yyyymmdd=%y%M%d)

This would create partition folders such as (useful format to automate recovery/initial population of data associated with partitions when using Athena)

  • yyyymmdd=20190101
  • yyyymmdd=20190102
  • yyyymmdd=20190103
  • yyyymmdd=20190104

 

Example 4: DOB column was used to create sub partitions of yr, mth and day
Assuming DOB Column Index 4 was a date

/@4(yr=%y)/@4(mth=%M)/@4(day=%d)

 

Example 5: model_nm=model_values and then sub partitions of yearmonth=YearMonth (multiple column partitioning)
Assuming Column Index 6 was a string (containing for example model_name_a, example model_name_b and example model_name_c) and Column Index 13 was a date field…

/@6(model_nm=%s)/@13(yearmonth=%y%M)
  • model_nm=model_name_a
    • yearmonth=201801
    • yearmonth=201802
    • yearmonth=201803
  • model_nm=model_name_b
    • yearmonth=201801
    • yearmonth=201802
    • yearmonth=201803
  • model_nm=model_name_c
    • yearmonth=201801
    • yearmonth=201802
    • yearmonth=201803

 

 

Available Partition Options

Format Datatype Description
%y
TIMESTAMP
Four digit year e.g. 2018
%M
TIMESTAMP
Two digit month with zero prefix e.g. March -> 03
%d
TIMESTAMP
Two digit date with zero prefix e.g. 01
%H
TIMESTAMP
Two digit 24 hour with zero prefix e.g. 00
%q
TIMESTAMP
Two digit month indicating the start month of the quarter e.g. March -> 01
%Q
TIMESTAMP
Two digit month indicating the end month of the quarter e.g. March -> 03
%r
TIMESTAMP
Two digit month indicating the start of the half year e.g. March -> 01
%R
TIMESTAMP
Two digit month indicating the end of the half year e.g. March -> 06
%i
INTEGER
Value of the integer e.g. 12345
%<n>i
INTEGER
Value of the integer prefixed by zeros to specified width e.g. %8i for 12345 is 00012345
%<m>.<n>i
INTEGER
Value of the integer is truncated to the number of zeros specified by <n> and prefixed by zeros to the width specified by <m> e.g. %8.2i for 12345 is 00012300
%.<n>i
INTEGER
Value of the integer is truncated to the number of zeros specified by <n> e.g. %.2i for 12345 is 12300
%s
VARCHAR
Value of the string e.g. ABCD
%<n>s
VARCHAR
Value of the string truncated to the specified width e.g. %2s for ABCD is AB

 

Schedule

 To configure extracts to run at a specific time perform the following steps.

  1. In case of Oracle Automatic is pre-selected and other options are disabled by default.
  2. For MS SQL Server you can choose the period in minutes.
  3. A daily extraction can be done at a specific time of the day by choosing hour and minutes in the drop-down.
  4. Extraction can also be scheduled on specific days of the week at a fixed time by checking the checkboxes next to the days and selecting hours and minutes in the drop-down.
  5. Click on the ‘Apply’ button to save the schedule.

API Controls for the Schedule:

Disable the Schedule by executing the below URL :

http://host:port/bryteflow/wv?cmd=rstat&func=off

Enable the Schedule by executing the below URL :

http://host:port/bryteflow/wv?cmd=rstat&func=on

 

API Controls to get the Statistics of the Ingest instance:

Executing the below URL will return the statistics of Ingest:

http://host:port/bryteflow/ingest/init?cmd=dcon&func=getstat

 

Add a new table to existing Extracts

You can add additional table(s) if the replication is up and running and the need arises to add a new table to extraction process…

  • Click the Schedule ‘off’ (top right of screen under Schedule tab)
  • Navigate to the ‘Data’ tab
    • Select the new table(s) by navigating into database instance name, schema name and table name(s)
    • Configure the table, considering the following
      • Transfer type
      • Partitioning folder (refer to the Partitioning section of this document for details)
      • Primary key column(s)
      • Columns to be masked (optional, masked columns are excluded from replication, for example salary data)
      • Click on the ‘Apply’ button
      • Repeat process for each table that is required
  • Navigate to the ‘Schedule’ Tab
    • Click on the ‘Sync New Tables’ button

This will initiate the new table(s) for a full extract, once completed BryteFlow Ingest will automatically resume with processing deltas for the new and all the previously configured tables.

Resync data for Existing tables

If the Table transfer type is Primary Key with History, to resync all the data from source, perform the following steps

  • Click the Schedule ‘off’ (top right of screen under Schedule tab)

 

  • For Resync Data on ALL configured tables…
    • Navigate to Schedule tab
    • Click on the ‘Full Extract’ button

 

  • For Resync Data on selected tables…
    • Navigate to Data Sources tab
      • Select the table(s) by navigating into database instance name, schema name and table name(s)
      • Click on the ‘Redo Initial Extract’ button
      • Repeat process if more than one table is required
    • Navigate to the ‘Schedule’ Tab
      • Click on the ‘Sync New Tables’ button

Rollback

In the event of unexpected issues (such as intermittent source database outages or perhaps network connectivity issues etc) it is possible to wind back in time the status of BryteFlow Ingest and replay all of the changes. Suppose there was a problem that occurred at say perhaps 16:04 hours, you can rollback BryteFlow Ingest to a point in time before these issues starting occurring, say 15:00.  To perform this operation….

  1. Navigate to the Schedule tab.

  1. Click on the ‘Rollback’ button
  2. The rollback screen appears, it provides a list of all of the points in time you can rollback to in descending order.
    • Dependent upon the source database log retention policy
  3. Select the required date (radio button) and click ‘Select’

  1. Click on ‘Rollback’ to initiate the Rollback
  2. The rollback will now catch up from 11:49 to ‘now’ automatically replaying all of the log entries and applying them to the destination

Configuration

The configuration tab provides access to the the following sub-tabs

  • Source Database
  • Destination Database
  • Credentials
  • License
  • Recovery
  • Remote Monitoring

Source Database

Web Port: The port on which the BrtyeFlow Ingest server will run on.

Max Catchup Log: The number of Oracle archive logs will be processed at one instance.

Run Every:  Set timer for the minimum minutes between catchup batches.

Convert RAW to Hex: To handle raw columns by converting to hex string instead of ignoring as CHAR(1).

Destination Database

Some additional configurations for Destination Databases,

File compression:  Output Compression method, available options are as follows

  • None
  • BZIP2
  • GZIP
  • Parquet
  • ORC(snappy)
  • ORC(zlib)

Loading threads:  Number of Redshift loading threads.

Schema for all tables:  Ignore source schema and put all tables in this schema on destination

Schema for staging tables:  Schema name to be used for staging tables in destination.

Retaining staging tables:  Check to Retain staging tables in destination.

Source Start Date:  Column name for source date for type-2 SCD record.

History End Date:  Column name for history end date for type-2 SCD record.

End Date Value:  End date used for history.

Ignore database name in schema:  Check to ignore DB name as part of schema prefix for destination tables.

No. of data slices:  Number of slices to split data file in to.

Max Updates: Combine updates that exceed this value.

 

Credentials / Destination File System

Keep S3 Files: Retain files in S3 after loading into AWS Redshift.

Use SSE:  Store in S3 using SSE (server-side encryption).

S3 Proxy Host: S3 proxy host name.

S3 Proxy Host Port:  S3 proxy port.

S3 Proxy user ID:  S3 proxy user id.

S3 Proxy Password:  S3 proxy password.

 

License

To get a valid license go to Configuration tab, then to the License tab and email the “Product ID” to the Bryte support team – support@bryteflow.com

NOTE: Licensing is not applicable when sourced from the AWS Marketplace.

 

High Availability / Recovery

BryteFlow Ingest provides High Availability Support, this means it automatically saves the current configuration and execution state to S3 and DynamoDB. As a result an instance of BryteFlow Ingest (including it’s current state) can be recovered should it be catastrophically lost. Before use this must be configured, select the Configuration tab and then the Recovery sub-tab to enter the required configuration.

BryteFlow keeps backup of every successful job execution on S3 and Dynamo DB and makes the latest version available for user to recover from. Follow the below steps to setup recovery and on how to recover in case of failure.

Recovery Configuration

Pre-requisite for Enabling recovery:

  • DynamoDB – PITR Enabled for backup and recovery.
  • Enable S3 versioning

Follow below steps to configure the S3 backup location for BryteFlow Ingest:

  1. In the Instance Name field enter a business friendly name for the current instance of BryteFlow Ingest
  2. Check  ‘Enable Recovery’
  3. Enter the destination of the recovery data in S3, for example s3://your_bucket_name/your_folder_name/Your_Ingest_name
  4. S3 KMS Id: Enter KMS Id in order to encrypt logs on S3, this is optional but recommended. Once provided, the logs will be encrypted using KMS encryption.
  5. Click on the ‘Apply’ button to confirm and save the details

 

The recovery data is stored in the DynamoDB (AWS fully managed NoSQL database service). The recovery data for the named instance (in this example Your_Ingest_Name is stored in a DynamoDB table called BryteFlow.Ingest as shown below:

Recovery Utilisation

To recover an instance of BryteFlow Ingest, you should source a new Instance of BryteFlow Ingest from the AWS Marketplace

  1. Use the AMI sourced from the AWS Marketplace
    • https://aws.amazon.com/marketplace/pp/B01MRLEJTK
      • Requires selection of BryteFlow Ingest volume
      • Requires selection of EC2 instance type
      • For details of launching the AMI from AWS Marketplace refer to BryteFlow user guide.
    • The endpoint will be an EC2 instance running BryteFlow Ingest (as a windows service) on port 8081
    • The role assigned to the EC2 should have the required policy. Please refer to the BryeFlow guide for the same.
    • The list of instances are held in DynamoDB. The EC2 role allows access to DynamoDB and the underlying table with the saved instances, make sure to have the needed permissions.
  2. Once the EC2 is launched, Type localhost:8081 into the Chrome browser to open the BryteFlow Ingest web console
  3. Click on the ‘Existing Instance’ button

  1. Select the existing instance you wish to restore from the list displayed, in this example there is only one (it being ‘Your_Ingest_Name’), once the required instance has been selected click on the the ‘Select’ button

BryteFlow Ingest will collect the configuration and saved execution state of the instance selected (in this case ‘Your_Ingest_Name’) and restore accordingly.

Once restored, its recommended to stop the EC2 at fault(previous install):

  • Go to AWS Ec2 console
  • Search for the tag for ‘BryteFlow Ingest’ or the one used to launch the application.
  • Select the older EC2 instance
  • Go to ‘Actions’ -> ‘Instance State’ and click ‘Terminate’

NOTE: Recovery can also be a method of partial migration between environments (for example DEV to PROD stacks). As the restore will clone the exact source environment and source state further configuration will be required (for example updating configuration options of the PROD stack EMR instance, S3 location etc). But this method could cut down on some of the workload in cases where there are 100’s of tables to be configured and you’re moving to a new EC2 instance.

Recovery from Faults

BryteFlow supports high availability and auto recovery mechanisms in case of faults and failures.

  • In case of AZ faults or Instance failures or when,
    • EC2 instance is terminated/effected, BryteFlow saves the last successful loads as a savepoint. Its resumes from the savepoint when re-started again in another AZ or on another EC2 instance.
    • EMR cluster failures, BryteFlow does continuous retries until successful.
    • EMR step failures, BryteFlow does continuous retries with exponential backoff feature, this prevents throttling exception to occur.
    • Redshift/ Snowflake connection issues, BryteFlow does continuous retries until successful.
    • Source DB Connection issues, BryteFlow does continuous retries until successful.
    • AWS S3 connection issue, BryteFlow does continuous retries until successful.

Customers looking for high availability support are recommended to configure their BryteFlow Ingest instance for High availability and recovery. Details to setup this feature is mentioned in the High Availability / Recovery section.

  • In case of Application faults and failures
    • Get notified by enabling CloudWatch Logs and metrics in BryteFlow Ingest. AWS CloudWatch events can be used for alerting by writing Lamdba functions for customer specific requirements.
    • Or Enable SNS’s in BryteFlow Ingest and subscribe to the SNS topic from AWS console

Details to setup these features are mentioned in the Remote Monitoring section.

  • EC2 monitoring : Its recommended to enable Cloudwatch monitoring for EC2’s or have health checks running on the EC2 and have proper alerts to get notified of any issues.
  • For disk space, BryteFlow Ingest sends status to CW Logs that includes free disk space in GB. Users can write Lambda functions around this to raise alarms.
Below is small guide to setup Lambda for disk checks:

Prerequisite:

  • Create an IAM role
  • Attach policy that allows Lambda functions to call AWS services as below:
    • {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Sid": "Stmt1",
            "Action": [
              "logs:FilterLogEvents",
              "logs:GetLogEvents",
              "logs:PutLogEvents"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:logs:<region>:<account_ID>:log-group:<log-group-name>:log-stream:<log-stream-name>"
          },
          { 
           "Sid": "Stmt2", 
           "Action": [ 
              "sns:Publish", 
              "sns:TagResource" 
              ], 
           "Effect": "Allow", 
           "Resource": "arn:aws:sns:<region>:<account_ID>:<topic_name>" 
          }
        ]
      }
  • Create an SNS topic (refer AWS documentation) and use the ARN in your AWS Lambda code.

Step to create a Lambda function:

  1. Login to AWS console and go to AWS Lambda dashboard.
  2. Click on Create function
  3. Give Function name
  4. Choose the runtime language Python 3.8 from the dropdown.
  5. Click on create the function
  6. On the next screen click on Add trigger
  7. Select CloudWatch Log from the dropdown
  8. Choose your log group
  9. Give filter name
  10. Click on Add (You can add multiple log group 1 by 1)
  11. Add your Lambda code in the function code window (scroll down the screen)
  12. Choose the ‘Lambda Execution role’
  13. Click on the Save button.
Sample Lambda code for disk check is provided below:
import json
import boto3
def lambda_handler(event, context):
freeGb = 100;
cloudwatch = boto3.client(‘logs’)
response = cloudwatch.get_log_events(
logGroupName=’Oracle_LogGroup’,
logStreamName=’Oracle_LogStream’,
startFromHead=False,
limit=100
)
#print(“events list “, response[“events”])
print (“———————> “)
for i in response[“events”]:
#print(“message –> “, i[“message”])
msg = json.loads(i[“message”])
#print(“type –> “,msg[“type”])
if msg[“type”] == “Status” and msg[“diskFreeGB”] < freeGb:
#print(“diskFreeGB –> “, msg[“diskFreeGB”])
sns = boto3.client(‘sns’)
sns.publish(TopicArn=’arn:aws:sns:us-west-2:689564010160:LambdaTrigger’,
Message=’Your free disk size is getting low please contact concerned team!’
)
Note: Email body can be customized in the above code according to Customers specifications.

Time to Recover

Recovery Point Objective(RPO)

BryteFlow does auto recovery of the instance and as it uses most durable services like S3 and Dynamo db to store its data, the data has unlimited retention.

In case of customer data, it totally depends on the Customer’s source db settings for data retention.  If the source data is available BryteFlow Ingest can recover and replicate from thereon.

BryteFlow ensures it meets the customer expectation of near real time latency and hence tries to recover automatically in most of the failure scenarios.

Recovery Time Objective(RTO)

For EC2 failures, RTO for BryteFlow applications is very minimal(in minutes) as it maintains the save-point of Ingest application in near real-time onto Dynamo DB, which is highly durable AWS service. When the Ingest instance is back online after a restart or after it was terminated abruptly(mostly in case of EC2 failures). It resumes from the last successful save-point and continues onward replication, without needing any full reload.

For EMR Failures, the RTO depends on the time taken to launch a new cluster so when using single EMR cluster the time varies from 10-30 minutes. Until the EMR is up it retries with exponential back-off mechanism until a successful connection is established and replication continues from the same point, without any data loss.

Please note: No full-reload is needed unlike other available solutions.

Recovery Testing

Once BryteFlow Ingest is recovered from any failure, follow below steps to perform basic checks before starting the replication in order to avoid any further errors or issues:

  1. Start BryteFlow Ingest service
  2. Open Ingest web console in chrome browser
  3. Go to ‘Connections’ tab in the left menu
  4. Under ‘Source’ check your source db configurations and do a ‘Test Connection’ to check the connectivity between BryteFlow and source db.
  5. If all is good, do the same for ‘Destination’ connections.
  6. If any hiccups encountered in source or destination db connectivity, troubleshoot further to resolve the issue until successful connection is established.
  7. Turn the ‘Schedule to ‘ON’ and resume ongoing replication.

Remote Monitoring

BryteFlow Ingest comes pre-configured with remote monitoring capabilities. These capabilities leverage existing AWS technology such as CloudWatch Logs/Events. CloudWatch can be used (in conjunction with other assets in the AWS ecosystem) to monitor the execution of BryteFlow Ingest and in the event of errors/failures raise the appropriate alarms. The events from Ingest application flows to  CloudWatch Logs and Kinesis(if configured). These events provides in detail application stats which can be used for any kind of custom monitoring.

In addition to the integration with CloudWatch and Kinesis, BryteFlow Ingest also writes the internal logs directly to S3 (BryteFlow Ingest console execution and error logs).

Prerequisites for enabling Remote Monitoring in BryteFlow Ingest are as below:

Please note below services are optional and Customers can choose to setup any, all or none.

 

 

To Configure the remote monitoring perform the following steps :

  1. Enter an Instance Name, this being a business friendly name for the current instance of BryteFlow Ingest.
  2. Check Enable S3 Logging if you want to record data to S3 (console/execution logs).
  3. Enter the destination of the logging data in S3, for example s3://your_bucket_name/your_folder_name
  4. Enter the name of the CloudWatch Log Group (this needs to be created first in the AWS console)
  5. Enter the name of the CloudWatch Log Stream under the aforementioned Log Group (again this needs to be created first in the AWS console)
  6. Check Enable CloudWatch Metrics if required
  7. Check Enable SNS Notifications
    1. Enter the Topic ARN  or SNS Topic Name in the SNS Topic input box
  8. If the events needs to be flown to Kinesis, please provide the Kinesis Stream name in the ‘Kinesis Stream’ field.
  9. Click apply to save the changes

The events that BryteFlow Ingest pushes to AWS CloudWatch Logs and metrics console are as follows, please refer to Appendix: Understanding Extraction Process for a more detailed breakdown.

Bryte Events Description
LogfileProcessed Archive log file processed (Oracle only)
TableExtracted Source table extract complete MS SQL Server and Oracle (initial extracts only)
ExtractCompleted Source extraction batch is complete
TableLoaded Destination table load is complete
LoadCompleted All destination table loads in a batch is complete
HaltError Unrecoverable error occurred and turned the Scheduler to OFF
RetryError Error occurred but will retry

 

Log

You can monitor the progress of your extracts and loads by navigating to the “Log” tab.   The log shows the progress and current activity of the Ingest instance. Filters can be set to view specific logs like errors etc.

BryteFlow Ingest stores the log files under your install folder, specifically under the \log folder.
The path to log  file is as follows <install folder of Ingest>\log\sirus*.log, for example

c:\Bryte\Bryte_Ingest_37\log\sirus-2019-01.log

The error files are also stored under the \log folder.
The path to log  file is as follows <install folder of Ingest>\log\error*.log, for example

c:\Bryte\Bryte_Ingest_37\log\error-2019-01.log

These logs can also be reviewed/stored in S3, please refer to the following section on Remote Monitoring for details.

Optimize usage of AWS resources / Save Cost

EMR Tagging

BryteFlow Ingest supports EMR Tagging feature which helps you dramatically to save cost on the EMR Clusters. This helps customers to control EMR cost by terminating the cluster when not in use without interrupting Ingest config and schedule.

You can add default tag ‘BryteflowIngest’ when creating a new Amazon EMR cluster for Ingest or you can add, edit, or remove tags from a running Amazon EMR cluster.  And, use the tag name and value in the EMR Configuration section of Ingest as in below image. Also very well handles the EMR changeover, in case an existing cluster tag name is changed and a new cluster with the correct tag is created, any existing jobs on the old cluster will complete and new jobs is started on the new cluster.

 

Tagging AWS Resources

AWS allows customers to assign metadata to their AWS resources in the form of tags. It is also recommended that you tag all the AWS resources created for and by BryteFlow for managing and organizing resources,  access control, cost tracking, automation, and organization.

Its recommended to use tags with names which are specific to the instance being created, for example, for a BryteFlow instance which is replicating source db which is a Production database server for Billing and Finances, tag names should reflect the dbname it is dedicated to like ‘BryteFlowIngest_BFS_EC2_Prod’, similarily for UAT environment it can be ‘BryteFlowIngest_BFS_EC2_UAT’. By doing this Customers can easily differentiate between the various AWS resources being used within their environment. Use similar tag names for each service.

BryteFlow recommends to tag the below listed AWS services used by with unique identifiable tag name.

  • AWS EC2
  • AWS EMR
  • AWS Redshift instances

For the detail guide on tagging resources in AWS refer to the AWS documentation links provided:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags.html

https://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-tagging.html

Upgrade software versions from AWS Marketplace

Users already using BryteFlow AMI Standard Edition can easily upgrade to the latest version of the software directly from AWS Marketplace by following few easy steps .

Steps to perform in your current install :

  • As you are planning to upgrade you need to make sure you have all your setup backed up.
  • To save your current instance setup and stats go to ‘Configuration’ ->  ‘Recovery’ 
  • In the Instance Name field enter a business friendly name for the current instance of BryteFlow Ingest
  • Check Enable Recovery
  • Enter the destination of the recovery data in S3, for example s3://your_bucket_name/your_folder_name/Your_Ingest_name
  • Click on the ‘Apply’ button to confirm and save the details
  • Once the recovery is setup, you are good to turn the ‘Schedule to OFF’ of the current version and let it come to a complete pause.
  • Go to product URL from AWS marketplace https://aws.amazon.com/marketplace/pp/B01MRLEJTK
  • In the product configuration settings, choose the latest available version from the ‘software version’ dropdown.
  • And, ‘Continue to Launch’ your new instance.
  • Choose Action ‘Launch from Website’
  • Select your EC2 instance type based on your data volume, recommendations available in the product detail page
  • Choose your VPC from the dropdown or go by the default
  • Please select the ‘Private Subnet’ under ‘Subnet Settings’. If none exists, its recommended to create one. Please follow detail AWS User Guide for Creating a Subnet.
  • Update the ‘Security Group Settings’ or create one based on BryteFlow recommended steps as below:
    • Assign a name for the security group, for eg: BryteFlowIngest
    • Enter a description of your choice
    • Add inbound rule(s) to RDP the EC2 with the Custom IP address
    • Add outbound rule(s) to allow the EC2 instance access the source db. DB ports will vary based on the source database , please add rules to allow the instance access to specific source database ports.
    • For more details, refer to BryteFlow recommendation on Network ACLs for your VPC in the section ‘Recommended network acl rules for ec2
  • Provide the Key Pair Settings by choosing an EC2 key pair of your own or ‘Create a key pair in EC2
  • Click ‘Launch’ to launch the EC2 instance.
  • The endpoint will be an EC2 instance running BryteFlow Ingest (as a windows service) on port 8081

 

Steps to perform in your new install :

  • Connect to the new instance using ‘Remote Desktop Connections’ to the EC2 launched via AMI.
  • Once connected to the EC2 instance, Launch BryteFlow Ingest from the google chrome browser using bookmark ‘BryteFlow Ingest’
  • Or type localhost:8081 into the Chrome browser to open the BryteFlow Ingest web console
  • This will bring up a page requesting either a ‘New Instance’ or an ‘Existing Instance’
    • Click on the ‘Existing Instance’ button as we need to resume BryteFlow Ingest from the last saved
    • Select the existing instance you wish to restore from the list displayed, in this example there is only one (it being ‘Your_Ingest_Name’), once the required instance has been selected click on the the ‘Select’ button

  • BryteFlow Ingest will collect the configuration and saved execution state of the instance selected (in this case ‘Your_Ingest_Name’) and restore accordingly.
  • Go to the ‘Connections’ tab, and test the ‘Source’, Destination’ and ‘File System’ connections prior to turning the ‘Schedule On’.
  • In  case of any connection issues, please check the firewall settings of the EC2 and source systems.
  • Once all connections are ‘Tested OK’, go to ‘Schedule’ tab and turn the schedule to ‘ON’.
  • This completes the upgrade and resumes the Ingestion as per the specified schedule.

Once upgraded, its recommended to stop the AMI for previous install:

  • Go to AWS Ec2 console
  • Search for the tag for ‘BryteFlow Ingest’ or the one used to launch the application.
  • Select the older EC2 instance
  • Go to ‘Actions’ -> ‘Instance State’ and click ‘Terminate’

BryteFlow: Licencing Model

BryteFlow’s Licensing model is based on the data volumes at the source getting replicated across to destination.

Volume based licensing are classified into below groups:

  • 100GB
  • 300GB
  • 1TB
  • > 1TB (contact Bryte Support)

BryteFlow products are available to use from the AWS marketplace for data volumes upto 1TB. For source data volumes > 1TB its recommended to contact BryteFlow support(email: support@bryteflow.com)  for detail information and inquiry.

BryteFlow Ingest : Pricing

BryteFlow products are available for use via AWS Marketplace. It comes in two different flavors:

  1. BryteFlow Standard Edition-Data Integration for S3, Redshift, Snowflake
  2. BryteFlow Enterprise Edition-Data Integration S3, Redshift, Snowflake

Please not that the infrastructure cost or use of any other AWS Services are additional to BryteFlow software cost. To estimate the cost of your AWS Services please use AWS Pricing calculator.

BryteFlow Support Information

Each of our products is backed by our responsive support team. Please allow for 24 hours for us to get back to you.  To get in touch with our support team, shoot an email to support@bryteflow.com

BryteFlow provides Level 3 support to all its Customers.

Tier 1 Business hours
Tier 2 24×7 Support 
Support Language English(US&UK)
Maintenance And Support Level Description
Business Hours Support Support for suspected bugs, errors or material differences between the use of Software and the specifications of Software outlined in the Documentation (Incidents). The scope of the Maintenance and Support Service is outlined with additional terms and conditions at Appendix A.

 

Premium Support §   24×7 Support

§   Email support

§   Access to customer portal

§   Software updates

§   Escalation management for critical issues

§   Severity 1 issues – 1 hour

§   Severity 2 issues – 2 hours

§   Severity 3 issues – 4 hours

 

 

Appendix: Understanding Extraction Process

Extraction Process

Understanding Extraction.

Extraction has two parts to it.

  1. Initial Extract.
  2. Delta Extract.

Initial Extract.

An initial extract is done for the first time when we are connecting a database to BryteFlow Ingest software. In this extract, the entire table is replicated from the source database to the destination (AWS S3 or AWS Redshift).

A typical extraction goes through the following processes. Below example shown is the extraction from MS SQL Server as source and Amazon S3 bucket and destination.

Extracting 1
Full Extract database_name:table_name
Info(ME188): Stage pre-bcp
Info(ME190): Stage post-bcp
Info(ME260): Stage post-process
Extracted 1
Full Extract database_name:table_name complete (4 records)
Load file 1
Loading table emr_database:dbo.names with 4 records(220 bytes)
Transferring null to S3
Transferred null 10,890 bytes in 8s to S3
Transferring database_name_table_name to S3

Delta Extract.

After the initial extract, when the database is replicated to the destination, we do a delta extract. In delta extracts, only the changes on the source database are extracted and merged with the destination.

After the initial extraction is done all the further extract are Delta Extracts (changes since the last extract.)

A typical delta extracts log file is shown below.

Extracting 2
Delta Extract database_name:table_name
Info(ME188): Stage pre-bcp
Info(ME190): Stage post-bcp
Info(ME260): Stage post-process
Delta Extract database_name complete (10 records)
Extracted 2
Load file 2
Loaded file 2

First Extract

Extracting Database for the first time.

Keep all defaults. Click on Full Extract.

The first extract always has to be a Full Extract. This gets the entire table across and then the delta’s are populated periodically with the desired frequency.

Schedule Extract

 

To configure extracts to run at a specific time perform the following steps.

  1. In case of Oracle Automatic is preselected and other options are disabled by default.
  2. For MS SQL Server you can choose the period in minutes.
  3. A daily extraction can be done at a specific time of the day by choosing hour and minutes in the drop-down.
  4. Extraction can also be scheduled on specific days of the week at a fixed time by checking the checkboxes next to the days and selecting hours and minutes in the drop-down.
  5. Click Apply to save the schedule.

Add a new table to existing Extracts

After database have been selected for extraction and they are replicating. If a need arises to add a new table to extraction process then it can be done by following steps.

  • Click the Schedule ‘off’ (top right of screen under Schedule tab)
  • Navigate to Data tab
    • Select the new table(s) by navigating into database instance name, schema name and table name(s)
    • Configure the table, considering the following
      • Select transfer type
      • Select partitioning folder (refer to Partitioning section for details)
      • Select primary key column(s) where applicable
      • Select columns to be masked (optional, these are excluded from extraction, for example salary data)
      • Click on the ‘Apply’ button
      • Click on the ‘Full Extract’ button
      • Repeat process if more than one table is required
  • Navigate to the Schedule Tab
    • Click on ‘Sync New Tables’ button

This will include the new table(s) for a full extract and also resume with deltas for all the previously configured tables and the newly added table(s).

Resync data for Existing tables

If the Table transfer type is Primary Key with History, to resync all the data from source, perform the following steps

  • Click the Schedule ‘off’ (top right of screen under Schedule tab)
  • For Resync Data on ALL configured tables…
    • Navigate to Schedule tab
    • Click on the ‘Full Extract’ button
  • For Resync Data on selected tables..
    • Navigate to Data Sources tab
      • Select the table(s) by navigating into database instance name, schema name and table name(s)
      • Click on the ‘Full Extract’ button
      • Repeat process if more than one table is required
    • Navigate to the Schedule Tab
      • Click on ‘Sync New Tables’ button

 

Appendix: Bryte Events for AWS CloudWatch Logs and SNS

BryteFlow Ingest supports connection to AWS Cloudwatch Logs, Cloudwatch Metrics and SNS. This can be used to monitor the operation of Bryteflow Ingest and integrate with other assets leveraging the AWS infrastructure.

AWS Cloudwatch Logs can be used to send logs of events like load completion or failure from Bryteflow Ingest. Cloudwatch Logs can be used to monitor error conditions and raise alarms.

Below are the list of Events that BryteFlow Ingest pushes to AWS CloudWatch Logs console  and for AWS SNS :

Bryte Events Description
LogfileProcessed Archive log file processed (Oracle only)
TableExtracted Source table extract complete MS SQL Server and Oracle (initial extracts only)
ExtractCompleted Source extraction batch is complete
TableLoaded Destination table load is complete
LoadCompleted All destination table loads in a batch is complete
HaltError Unrecoverable error occurred and turned the Scheduler to OFF
RetryError Error occurred but will retry

Below is the detail for each of the Bryte Events :

Event : LogfileProcessed

Attribute

Is Metric(Y/N)?

Description

type N “LogfileProcessed”
generated N Timestamp of message
source N Instance name
sourceType N “CDC”
fileSeq N File sequence
file N File name
dictLoadMS Y Time taken to load dictionary in ms
CurrentDBDate N Current database date
CurrentServerDate N Current Bryte server date
parseMS Y Time taken to parse file in ms
parseComplete N Timestamp when parsing is complete
sourceDate N Source date
Event : TableExtracted

Attribute

Is Metric(Y/N)?

Description

type N “TableLoaded”
subType N Table name
generated N Timestamp of message
source N Instance name
sourceType N “CDC”
tabName N Table name
success N true/false
message N Status message
sourceTS N Source date time
sourceInserts Y No. of Inserts in source
sourceUpdates Y No. of Updates in source
sourceDeletes Y No. of Deletes in source
Event : ExtractCompleted

Attribute

Is Metric(Y/N)?

Description

type N “ExtractCompleted”
generated N Timestamp of message
source N Instance name
sourceType N “CDC”
jobType N “EXTRACT”
jobSubType N Extract type
success N Y/N
message N Status message
runId N Run Id
sourceDate N Source date
dbDate N Current database date
fromSeq N Start file sequence
toSeq N End file sequence
extractId N Run id for extract
tableErrors Y Count of table errors
tableTotals Y Count of total tables
Event:TableLoaded

Attribute

Is Metric(Y/N)?

Description

type N “TableLoaded”
subType N Table name
generated N Timestamp of message
source N Instance name
sourceType N “CDC”
tabName N Table name
success N true/false
message N Status message
sourceTS N Source date time
sourceInserts Y No. of Inserts in source
sourceUpdates Y No. of Updates in source
sourceDeletes Y No. of Deletes in source
destInserts Y No. of Inserts in destination
destUpdates Y No. of Updates in destination
destDeletes Y No. of Deletes in destination
Event : LoadCompleted

Attribute

Is Metric(Y/N)?

Description

type N “LoadCompleted”
generated N Timestamp of message
source N Instance name
sourceType N “CDC”
jobType N “LOAD”
jobSubType N Sub type of the “LOAD”
success N Y/N
message N Status message
runId N Run Id
sourceDate N Source date
dbDate N Current database date
fromSeq N Start file sequence
toSeq N End file sequence
extractId N Run id for extract
tableErrors Y Count of table errors
tableTotals Y Count of total tables
Event : HaltError

Attribute

Is Metric (Y/N)?

Description

type N “HaltError”
generated N Timestamp of message
source N Instance name
sourceType N “CDC”
message N Error message
errorId N Short identifier
Event : RetryError

Attribute

Is Metric (Y/N) ?

Description

type N “RetryError”
generated N Timestamp of message
source N Instance name
sourceType N “CDC”
message N Error message
errorId N Short identifier

Appendix: Release Notes

Release details (by date descending, latest version first)

Bryteflow Ingest 3.11.4

Release Notes BryteFlow Ingest – v3.11.4

Released February 2022

New Features

  1.  PostgreSQL DB as a new source connector.
  2. Oracle RAC as a new source connector.
  3. Oracle Pluggable DB as a new source connector.
  4. Access to views on Ingest UI for SAPHANA Full Extracts and Timestamps.
  5. Support for new destination – Load S3 Deltas.
  6. Support for new destination – Google BigQuery from all SAP sources.
  7. Support for new destination destination – ADLS2.
  8. Enhancements for Oracle Continuous Logminer sources
  9. Active directory (AD) support for MsSQL server destination.
  10. Enhancements in extracting BLOB/CLOB data from Oracle.
  11. Enhancements related to Log4j issue.

Known Issues

1. Sync Struct not supported for S3/EMR destination with CSV(Bzip2,Gzip,None) output format.
Only supported for Parquet (Snappy) and Orc (Snappy).
2. Athena table creation is supported for Parquet (Snappy) compression only from S3/EMR.
3. Source database type- JDBC Full Extract do not work for all databases.

BryteFlow Ingest 3.11

Release Notes BryteFlow Ingest – v3.11 

Released April 2021

New Features

1. NEW UI on DATA page with tree and list view with filters for table.
2. Support for Oracle RAC source.
3. Timestamp Changes With Daylight Savings.
4. Fix for the tables having special characters in table-name like slash and underscore.
5. Fixes related to Type-Change tables.
6. Fixes for Postgres source.

Known Issues

1. Sync Struct not supported for S3/EMR destination with CSV(Bzip2,Gzip,None) output format.
Only supported for Parquet (Snappy) and Orc (Snappy).
2. Athena table creation is supported for Parquet (Snappy) compression only from S3/EMR.
3. Source database type- JDBC Full Extract do not work for all databases.

BryteFlow Ingest 3.10.1

Release Notes BryteFlow Ingest – v3.10.1

Released October 2020

New Features

  • Auto-selection of primary key when you go to the ‘Data’ section and choose the table. 
  • Support for SAP HANA as a source.
  • GUI performance improvement
  • EMR cluster changeover logic.
  • MsSQL – Partition for NULL values of DATE column in default value 1970-01-01 on S3.

Bug Fixes

  • Timestamp Changes with Daylight savings
  • Fix for the tables having special characters in table-name like slash and underscore.
  • Fixes related to Type-Change tables 

 

Limitations:

  •  Sync Struct feature is not supported for S3/emr destination with CSV(Bzip2,Gzip,None) output format.
    Only supported for Parquet (Snappy) and Orc (Snappy).
  • Athena table creation is supported for Parquet (Snappy) compression only from S3/EMR.
     

BryteFlow Ingest 3.10

Release Notes BryteFlow Ingest – v3.10 

Released June 2020

New Features

  • Support for Direct Snowflake Multiload option in Destination DB
  • Changes in fields of Destination Database connection page

BryteFlow Ingest 3.10

Release Notes BryteFlow Ingest – v3.10

Released May 2020

New Features

  • Support for Oracle Redo Logs as a source
  • Support for Salesforce as a source
  • Support for Snowflake as a destination using Amazon S3
  • Support for Snowflake as a destination without using Amazon S3 (Direct Load)
  • IAM access available as an option instead of using AWS Credentials
  • Automatic creation of tables in Athena for S3/EMR destination
  • JDBC options can now be entered into the Source and Destination Database
    details. These extend the JDBC URL used to access the databases.
  • Cloudwatch Logs are sent out if connections fail for an hour at a time
  • Snowflake fields can now be entered separately as
    Account/Warehouse/Database
  • The Log page now uses filters to set level of log messages displayed
  • Supports conversion of UTF-8 characters to UTF-16 for MS SQL Server as a
    destination
  • Support for Kinesis as a destination for event messages in addition to those
    currently sent to Cloudwatch Logs
  • Support for Type Change of fields from native character (or numeric )format to
    Integer, Long, Float, Date, Timestamp
  • Rollback files now shown with the latest file on top
  • Cloudwatch Log message sent if EMR process runs too long. A message is sent
    every hour.
  • SNS Topic Name can be entered instead of full ARN
  • Backup/recovery of files uses instance name in addition to the provided S3
    folder name. This should allow the same folder to be specified across all
    instances.

Bug Fixes

  • Scheduler is turned off on 3 unsuccessful attempts at loading the data. This is not including connection errors.
  • Scheduler is turned off when a structure mismatch is detected
  • Tables with special characters like / and $ are now handled correctly for an Oracle source

Known Issues

  • A full extract will not close deleted records with an S3/EMR destination
  • Change in table structure may not be detected on S3/EMR destination when using CSV (any compression). The workaround is to use the Parquet format.
  • Connection to AWS Aurora may show a verbose warning even if connection is successful

BryteFlow Ingest 3.9.3

Release Notes BryteFlow Ingest – v3.9.3

Released March 2020

New Features

  • MySQL is now supported as a source
  • CloudWatch messages are sent with each log file processed in MySQL
  • Ingest differentiates between null and zero length strings in MySQL
  • Support for fail-over database for Oracle

Bug Fixes

  • Some errors around Delta extracts for tables with Primary Key Only replication have been fixed
  • Warnings on unknown types are now handled and do not appear
  • Issues around table names with special characters like $ or / have now been fixed
  • Communication link errors with Aurora database are now handled
  • A number of MS SQL data types (BIGINT, NUMERIC, BINARY, SMALLDATETIME) are now handled correctly on S3/EMR destination
  • Issue around @ character in Oracle passwords is now fixed
  • Errors around handling timestamps in the parquet-snappy format are ow fixed
  • Ingest no longer needs bucket level access when using S3/EMR+Snowflake as a destination – folder level access is sufficient
  • Strings are correctly handled now in the parquet format instead of being shown as binary data.

BryteFlow Ingest 3.9

Release Notes BryteFlow Ingest – v3.9 

Released December 2019

New Features

  • New UI
  • Redo Initial Extract and Skip Initial Extract functionality changed in the UI as
    well as in implementation
  • CloudWatch Log Status message now includes free disk space
  • Performance improvement in accessing database across the internet
  • Support in the UI for offset time in periodic extract
  • Support in the UI for offset time in Oracle Log catchup time

Bug Fixes

  • Fixes to IAM Role used for Parquet and ORC format for S3 + Redshift destination
  • Improved error message for expired Orcale archive logs

Known Issues

  • Structure Sync operations in Snowflake do not work for a small number of cases
  • Destination comparison may not work when file format is parquet.snappy

BryteFlow Ingest 3.8

Release Notes BryteFlow Ingest – v3.8 

Released November 2019

New Features

  •  Optimization for access to on-premises Oracle from Bryteflow server on AWS
  •  Support for Parquet and ORC formats for S3/EMR + Redshift
  • Process for loading large tables using Ingest and Ingest XL
  • User defined namespace for Cloudwatch Metrics
  • Exponential backoff on throttling of EMR access
  • Support for Oracle log mining on a separate Oracle instance
  • Cutoff date now has a cutoff offset as well
  • Support for processing partitions in a specified order
  • Support for Snowflake destination

Bug Fixes

  • Support for UTF-8 characters in data
  • String now appear correctly in Athena for files in Parquet format

Known Issues

  • Some Sync operations may show structure differences in Snowflake where none exist

Release Notes BryteFlow Ingest – v3.7.3

Release Notes BryteFlow Ingest – v3.7.3

Released April 2019

New Features 

  • Notifications: A notification icon appears on the top bar. The number of issues appears as a bubble. The bubble is red if there are at least one error and orange for warnings. Hovering gives the count of errors and warnings in a tooltip. Clicking on the icon lists the issues in a dropdown list.
  • Help: Clicking on the help icon takes you to the online documentation.
  • Drivers: The supported source and destination drivers have been streamlined.
  • EMR tags: EMR instance can be specified by Cluster ID or a tag value for the tag BryteFlowIngest or a tag and value expressed as “tag=value”.
  • Port: Changing the web port is allowed for non-AMI installations.

Bug Fixes

  • An issue with an error on no initial extract even when skip initial extract is done for a table is now fixed.
  • Some attribute values in Cloudwatch Logs which were previously blank have now been fixed.
  • An issue where all pending jobs were canceled on a failure in the current job is now fixed.
  • A redundant field when “S3 Files using EMR” is selected as a destination has been removed.
  • The Apply button is disabled if no changes have been made in Source/Destination/File screens. This gets around the problem of the connection being flagged with a warning on pressing apply even if no fields have been changed.
  • Initial extract in Oracle now sets the effective date to the database date instead of the server date.

Known Issues

  •  Non-AMI EC2 may show some warning messages on startup.

BryteFlow Ingest 3.7

Release Notes BryteFlow Ingest – v3.7

Released:  January 2019

  • BryteFlow Ingest 3.7 available on AWS Marketplace as an AMI
    • Pay as you go on AWS Marketplace
    • Hourly/Annual billing options on AWS Marketplace
    • No licence keys on AWS Marketplace
    • 5 day trial on AMI (new customers only)
  • Volume based licensing
    • 100GB
    • 300GB
    • 1TB
    • > 1TB (contact Bryte Support)
  • High Availability
    • Automatic backup of current state
    • Automatic cut-over and recovery following EC2 failure
    • IAM support
  • Rollback to previous saved point in time
    • Dependent upon source db logs
  • Partitioning
    • Built in to Interface
    • Partition names configurable, wide range of formats
    • AWS Athena friendly partition names
  • New S3 compression options
    • Parquet
    • ORC(Snappy)
    •  ORC(Zlib)
  • Remote Monitoring (integrated with AWS Services)
    • Cloudwatch
    • Logging
    • Metrics
    • SNS
Suggest Edit