Oracle to Amazon S3 – how to refresh continuously

Organisations that have been around long enough would have volumes of transaction data collected and stored in Oracle databases over the years.  And data is the currency of the digital economy. But all too often organisations have found it difficult to monetize the data because there is no easy, accurate and economical means in unifying the Oracle data silos dotting the various lines of businesses.

As advances are made around AI/ML and analytics at breakneck pace, and are easily consumable via the AWS cloud platform, organisations are looking to gain a competitive edge by applying new technologies to their data sets and unlocking new insights about their business and customers. To make this a reality, organisations are bringing their data closer to the new technologies by first centralizing their data in a cloud data lake like Amazon Simple Storage Service or Amazon S3, which can accommodate large volumes and a variety of data. What’s more, S3 has a very low cost  business model and is a core AWS component service in modern data architecture.

Bringing data from legacy Oracle silos to Amazon S3 is no mean feat, especially if you don’t want users of the applications underpinned by the Oracle databases to suffer  performance issues. That’s why you need a log-based Change Data Capture (CDC) data integration tool from Oracle to Amazon S3. One that replicates data to S3 with minimal impact on the source Oracle system.

Given S3 is infinite object storage, every change in Oracle data becomes effectively a new file on S3. Every other tool out there either gets you a full load from the source, or simply lands all the delta changes on S3 (and there could be thousands of them) and the onus is back on the organisation to manually merge or stitch these files into a replica of an Oracle source table. Manually coding the merges is cumbersome, prone to errors, difficult to diagnose, and accompanied by endless maintenance overheads. And if you haven’t yet added SCD Type-2 history to the mix, as, if you need that, projects could easily get derailed.

Then there’s the challenge of ingesting an Oracle table that is larger than 20GB, what would be the optimal approach? To do a full extract every time, puts a lot of load on the source and on the destination. Further, the data can be made available only once a day. That’s where a CDC tool with built-in large table partitioning comes into its own. A large table can be partitioned by the timeframe of year i.e. 2019, 2018, 2017 etc. So, if the end-users are required to analyse 2019’s sales results they only have to use a smaller 2019 data set, which means the queries are faster and much more cost-effective, if they query using AWS Athena service. It would be a huge bonus if the CDC software creates partitions that are Athena-friendly.

Most businesses are now operating in an on-demand world, which means they are constantly making data-driven decisions. Overnight analytics and decision making will not keep pace with customers’ demands or help keep up with your competitors. That’s why having a robust, high performing and near real-time data integration tool is critical to the survival of your business.

BryteFlow is built for bridging the legacy data silos to the new digital economy.

BryteFlow Ingest is meant for data ingestion and can sync terabytes of data in near real-time with minimal impact on Oracle source systems as it does a CDC to S3. Table replicas are made available like-for-like on Amazon S3, so when you query a table on S3, it is as though you have queried the source Oracle database. Or you could configure SCD Type2 history to give you every change with a start and end date for historical or point-in-time analytics.

If you operate in a regulated industry, BryteFlow TruData will continuously do a checksum comparison of that table with your Oracle data for data completeness.