How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

White Paper

How Change Data

Capture Speeds

Decision-Making

and Lowers

Operational Costs

Support Streaming Analytics and AI/ML

Use Cases in Real Time

Contents

The Evolution of Modern Data Architectures – Key Trends and Drivers 3

The Need for Real-Time Change Data Capture 5

Change Data Capture Use Cases 6

Benets of Change Data Capture 10

Methods of Change Data Capture 11

Change Data Capture with Informatica Intelligent Data Management Cloud™ 12

Conclusion 14

Next Steps 14

About Informatica 15

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

The Evolution of Modern Data Architectures – Key Trends and Drivers

Data is at the core of how modern enterprises run their businesses and is a crucial enabler in driving

digital transformation. Digital transformation has never been more critical than it is today, as the pace

of disruption is only accelerating. According to a recent study from Innosight,

the 30 to 35-year average

tenure of S&P 500 companies in the late 1970s is forecast to shrink to 15 to 20 years this decade.

Organizations are trying to become data-centric, but the traditional approaches don’t scale and don’t

provide insights that are required to drive innovation. Over time, enterprises accumulate terabytes and

petabytes of data stored in on-premises databases, ERP and CRM systems. They collect the data, run

ETL jobs and ingest data into a data warehouse such as Teradata, SQL server or an Oracle warehouse.

And when the data increases, they add more data warehouse appliances. The challenge with this approach

is that it creates data silos. As a result, organizations are unable to create end-to-end, 360-degree views of

their customers, markets and products.

With a modern data architecture, organizations can take advantage of exponential data growth and gain

the benets of end-to-end analytics insights. Migrating from a legacy on-premises data warehouse to a

cloud data warehouse and cloud data lake provides benets such as performance, availability, cost,

manageability and flexibility without compromising security.

Data architecture is going through three fundamental shifts that are disrupting traditional methods of

handling, analyzing and structuring data.

1. Data Warehouse to Data Lake/Lakehouse

A data lake is a strong complement to a data warehouse. And many enterprises are now adopting a new

combined architecture, the “lakehouse.” A lakehouse merges data warehouses and data lakes in one data

platform. A lakehouse brings the best of both worlds together by combining technologies for business

analytics and decision-making with those for exploratory analytics and data science.

The data lake provides cost-effective processing and storage, which is distributable, highly available and

can store data without applying a schema to it. Instead, the schema can be applied later to read the data

for analytics consumption. You can store many different data types: structured, unstructured or semi-

structured. Data lakes are critical for organizations that want to be innovative and intend to address

articial intelligence (AI) and machine learning (ML) use cases.

Innosight, 2021 Corporate Longevity Report, 2021

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

2. Batch Processing to Stream Processing

While there will always be a place for batch processing, there is a notable increase in the demand for

streaming content; the need for capture and analysis of real-time data increases as the value of time-

sensitive data increases. With the adoption of Kappa architecture and other streaming-rst architectural

patterns, stream processing has become mainstream.

Real-time processing of customer data can create

new revenue opportunities, and tracking and analyzing IoT sensor data can improve operational efciency.

Batch processing can also be combined with stream processing to enrich the content even more.

Whether it is for strategic decisions or a moment-based decision, stream processing enables organizations

to make accurate and faster decisions based on fresh data. For example, stream processing enables you to

identify crosssell opportunities when a customer walks into a store. Real-time stream processing helps to

capture the customer’s location and integrate location data in real time with historical insights from batch

data to provide the correct inmoment cross-sell opportunity.

3. On-Premises to Cloud

Cloud has become mainstream as security concerns about the technology have abated in most industries.

Resource elasticity and cost advantages have made cloud a signicant component of multi-datacenter

architectures. According to a 2022 Flexera report, enterprises are running 49% of workloads and storing

46% of data in a public cloud.

These technology trends enable enterprises to realize benets such as agility, flexibility and efciency, as

well as innovation. Businesses can now get better insights from their data and offer the right opportunities

to the right individuals with a seamless experience. These fundamental shifts in data architecture are

opening up new use cases that were not possible with traditional data management approaches. This is

especially true of real-time streaming analytics use cases in the cloud. The Venn diagram below shows the

overlapping use cases for data lakes, streaming and cloud.

Informatica, “Kappa Architecture – Easy Adoption with Informatica Streaming Data Management Solution”

Flexera, State of the Cloud Report, 2022

Data Lake

Streaming Cloud

Overlapping Use Cases:

• Cloud Data Ingestion

and Replication

• Streaming Analytics

• AI/ML

Files Relational Systems Legacy Systems Data Warehouse

Figure 1. A Venn diagram showing overlapping use cases.

The Need for Real-Time Change Data Capture

Today, just about every industry — healthcare, retail, telco, banking, etc. — is being transformed by data.

As data continues to grow, the need for advanced data engineering architectures that combine data lakes,

cloud and streaming becomes critical. This data is also time-bound: data is created in real time and its

value diminishes over time. Organizations need to take immediate action on their data when it is fresh, or

else they will lose out on business opportunities.

What Is Change Data Capture?

Data with timely value comes from various sources such as log les, machine logs, IoT devices, weblogs,

social media, etc. To ensure you don’t miss the opportunities for real-time insights, it’s essential to have a

means to rapidly capture data changes and updates from transactional data sources. Change data capture

(CDC) is a design pattern that allows users to detect changes at the data source and then apply them

throughout the enterprise.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

In the case of relational transactional databases, CDC technology helps customers capture the changes in

a system as they happen and propagate the changes onto analytical systems for real-time processing and

analytics. For example, say you have two databases (source and target) and you update a data point in the

source database. Now, you would like to have the same change to be reflected in the target database. With

CDC, you can collect transactional data manipulation language (DML) and data denition language (DDL)

instructions (for example, insert, update, delete, create, modify, etc.) to keep target systems in sync with the

source system by replication of these operations in near real time.

CDC is used for continuous, incremental data synchronization and loading. CDC can keep multiple systems

in sync as well as monitor the source data for schema changes. CDC can dynamically modify the data

stream to accommodate schema changes, which is important for different types of data coming in from

live data sources. CDC continuously captures real-time changes in data in seconds or minutes. The data

is then ingested into target systems such as cloud data warehouses and data lakes or cloud messaging

systems, helping organizations to develop actionable data for advanced analytics and AI/ML use cases.

Change Data Capture Use Cases

CDC is used for various use cases such as synchronization of databases (a traditional use case) and

real-time streaming analytics and cloud data lake ingestion (more modern use cases).

Figure 2. Examples of CDC use cases.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

Real-Time Database Synchronization

Imagine you have an online system that is continuously updating your application database. Let’s say you

have a customer who is registering on your web application. As part of registration, the customer will need

to provide information such as name, age and telephone number. That same record is now created in your

source systems.

Once the record is created in the source system, the change data is enabled. Essentially, the system will

submit that change event out to some form of a listener and that listener can then process that same

change and create a record in the target system. Now, the transactions in both the source and target

system will have the same customer records as the data is synchronized. This is an example of when

a new record is created.

With CDC, we can also capture incremental changes to the record and schema drift. So, imagine the same

customer comes back and updates some information — like the telephone number. The changed telephone

number gets updated in the source system. CDC will capture this event again and update the record in the

target database in real time. You can also dene how to treat the changes (for example, replicate or ignore

them). This is an elementary example of data synchronization using CDC technology.

Reference architectures can become complex in large enterprises, where different teams are working on

different sets of technologies and each has its requirements. When you apply CDC methodology, you will

need to modify the application processing mechanism to create a unied solution across all systems. As

your data is loaded into different types of discrete systems, you may need to apply in-memory, ltration

transformation and other types of operators to the data. Your data is processed and pushed down to the

system that is the target for your CDCbased needs. You may want to process your data in batch (this is

done mostly for the rst time you load your data from source system to your target system). Once you

have your initial data loaded, you may want to process the incremental changes in real time.

You can override the schema drift options when you resume a database ingestion job that is in the stopped,

aborted, or failed state. The overrides affect only those tables that are currently in the error state because

of the Stop Table or Stop Job Schema Drift option.

CDC also can transmit source schema/DDL updates into message streams and integrate with messaging

schema registries to ensure that analytics consumers understand the metadata.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

Figure 3. Synchronization of traditional database with change data capture.

Modern Real-Time Streaming Analytics and Cloud Data Lake Ingestion

In a modern data architecture, customers can continuously ingest CDC data into a data lake in the form of

an automated data pipeline. CDC can propagate data to message queues to help analyze streaming log

data for any kind of issue. In this case, data is pushed to a messaging queue like Apache Kafka or Amazon

Kinesis as a target for reading and processing data. CDC can also be leveraged for data migration from

on-premises to cloud, advanced analytics and AI/ML use cases by analyzing data to generate real-time

insights and data modernization initiatives.

CDC helps avoid some of the bottlenecks you may encounter when provisioning large amounts of data

from legacy data stores to the new data lake as the data only needs to be loaded once; thereafter, CDC

just provisions the changes to the data lake. In fact, many organizations have used data lake over ETL

platforms, as the data lake environment tends to be less expensive.

The most difcult part of the data lake is maintaining it with current data. CDC can be extremely helpful

there. It can help you save computing and network costs, especially in the case of cloud targets, as you are

not moving terabytes of data unnecessarily across your network. You can focus on the change in the data.

With support for technologies like Apache Spark for real-time processing, CDC is the underlying technology

for driving advanced real-time analytics and AI/ML use cases.

Customers can also take advantage of CDC technology coupled with streaming ingestion to address real-

time analytics use cases. The streaming ingestion scenarios could include: IoT data ingestion, clickstream

ingestion or log les ingestion. Now, let’s deep dive into two such real world use cases: real-time fraud

detection at a bank and real-time inventory analysis at a retailer.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

Real-Time Fraud Detection

A large corporate bank was facing challenges with a sudden increase in fraudulent activities, which resulted

in unhappy customers and loss of business. The bank wanted to build a real-time analytics platform to

proactively alert customers about potential fraud so the customers could take remedial actions. To do

so, the bank needed to ingest transactional information from its database in real time and apply a fraud

detection ML model to identify potentially fraudulent transactions.

The Informatica Intelligent Data Management Cloud™ (IDMC) is the industry’s most comprehensive, AI-

powered, end-to-end data management platform. It offers key capabilities required to support real-time

streaming analytics and data replication with versatile connectivity through cloud data ingestion and

replication services. As part of IDMC, the Informatica CDC services capture the change data in real time

from the transactional database and publish it into Apache Kafka. The Informatica streaming ingestion

services read this data and then enrich the data for real-time fraud analytics that enable the fraud

monitoring tool to proactively send text and email alerts to customers about potential fraud detection.

As a result, the bank improves customer experience, thereby helping to retain and grow its customer base.

Figure 4. Reference architecture for real-time fraud detection.

Optimizing Inventory for Operational Readiness

A retail giant that operates several supermarkets and multi-department stores was struggling with poor

inventory management. This resulted in a failure to meet demand and a loss of customers. It also posed a

question on the retailer’s operational readiness. The retailer needed the point-of-sales data from all the stores

across the country at one place in real time to assist with inventory analysis to run the stores efciently.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

The Informatica CDC services capture the change data in real time from IoT devices and feed it to the IBM MQ

from where Informatica Streaming Ingestion services enriches the data and transfers it to a cloud data lake.

This data is then used for real-time downstream analytics to inform key decisions for managing the inventory.

Figure 5. Reference architecture for real-time inventory analysis.

Benets of Change Data Capture

CDC captures changes from the database transaction log and publishes them to a destination such as a

cloud data lake, cloud data warehouse or message hub. This has several benets for the organization.

1. Enables Faster Decision Making: The biggest advantage of CDC technology is that it fuels data for

real-time analytics. This helps organizations make faster and more accurate decisions in real time by

capitalizing on fast-moving data.

This means nding data, analyzing it and acting on it in real time. Organizations can create hyper-personal,

real-time digital experiences for their customers with real-time analytics. For example, real-time analytics

can enable restaurants to create a personalized menu for individual customers based on historical data,

along with data from mobile or wearable devices to provide customers with the best deals and offers.

To look at another example, an online retailer wants to sell more motorcycle helmets and maximize prots.

The retailer detects a pattern in real time that when a user views at least three motorcycle safety products,

including at least one helmet, analytics indicate that the customer is interested.

The retailer then displays the most protable helmets. Now, add a time window parameter to this to nd the

real-time total sales of motorcycle helmets. The reason to do this is to move the price up and down. The rst

goal is to sell more motorcycle helmets, and the second goal is to maximize protability. If sales are trending

lower than usual, and this customer is price sensitive, then the retailer dynamically lowers the price.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

2. Minimizes Impact on Production: When done through a tool, CDC brings many advantages compared

to script based or hand-coded implementation. Since it is time-consuming to move data from source to

the production server, CDC captures incremental updates with a minimal source-to-target impact. It can

read and consume incremental changes in real time to continuously feed the analytics target without

disrupting production databases. This helps to scale efciently to execute high-volume data transfers to

the analytics target.

3. Improves Time to Value and Lowers TCO: CDC enables you to build your offline pipeline faster

without worrying about scripting. It helps data engineers and data architects to focus on important tasks.

It also helps minimize total cost of ownership by removing the dependency on highly skilled users for

these applications.

Methods of Change Data Capture

There are several methods and technologies to achieve CDC, and each has its merit depending on the

use case. Here are the common methods, how they work and their advantages and disadvantages.

Timestamps: The simplest way to implement CDC is to use a timestamp column within a table. This

technique depends upon a timestamp eld being available in the source table(s) to identify and extract

change datasets. At minimum, one timestamp eld is required for implementing timestamp-based CDC.

In some source systems, there are two timestamp source elds — one to store the time at which the record

was created, and another eld to store the time at which the record was last changed. The timestamp

column should be changed every time there is a change in a row.

Timestamps are the easiest to implement and most widely used CDC technique for extracting incremental

data. However, this approach only retrieves rows that have been changed since the data was last extracted.

There may be issues with the integrity of the data in this method. For instance, if a row in the table has

been deleted, there will be no DATE_MODIFIED column for this row, and the deletion will not be captured.

This method can also slow production performance by consuming source CPU cycles.

Triggers: Another method for building CDC at the application level is dening triggers and creating your

own change log in shadow tables. Shadow tables may store the entire row to keep track of every single

column change, or they may store only the primary key and operation type (insert, update, or delete).

Using triggers for CDC has the following drawbacks:

y Increases processing overhead

y Slows down source production operations

y Impacts application performance

y Is often not allowed by database administrators

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

Log-Based CDC: Transactional databases store all changes in a transaction log that helps the database

to recover in the event of a crash. With log-based CDC, new database transactions — including inserts,

updates and deletes — are read from source databases’ transactions. Changes are captured without

making application-level changes and without having to scan operational tables, both of which add

additional workload and reduce source systems’ performance. Log-based CDC is the most preferred,

fastest and least disruptive CDC method because it requires no additional modications to existing

databases or applications.

Change Data Capture with Informatica Intelligent Data Management Cloud™

SaaS Apps

Sources

Real-time /

Streaming

Sources

On-premises

Sources

DATA SOURCES

+ +

Mainframe Applications Databases

IoT Machine Data Logs

AI-Powered Metadata Intelligence & Automation

Connectivity

Metadata System of Record

DATA

CATALOG

DISCOVER &

UNDERSTAND

DATA

MARKETPLACE

SHARE &

DEMOCRATIZE

GOVERNANCE,

ACCESS & PRIVACY

GOVERN &

PROTECT

MDM & 360

APPLICATIONS

MASTER &

RELATE

DATA QUALITY &

OBSERVABILITY

CLEANSE &

TRUST

API & APP

INTEGRATION

CONNECT &

AUTOMATE

DATA INTEGRATION &

ENGINEERING

ACCESS &

INTEGRATE

ETL Developer Data Engineer Citizen Integrator Data Scientist Data Analyst Business Users

DATA CONSUMERS

Intelligent Data Management Cloud

™

Figure 6. Informatica IDMC is the industry’s most comprehensive solution for multi-cloud, modern data management.

Informatica Intelligent Data Management Cloud™ (IDMC) is the industry’s most comprehensive, AI-powered,

end-to-end data management platform. It offers key capabilities required to support real-time streaming

analytics and data replication with versatile connectivity through cloud data ingestion and replication services.

The Informatica Cloud Data Ingestion and Replication service provides database and application ingestion

capabilities that help you ingest initial and incremental loads. These could come from relational databases

and SaaS systems onto cloud data lakes and cloud data warehouses as well as messaging hubs.

Regardless, the changes in data are being captured in real time. It also offers schema drift capabilities

to help customers manage changes in the schema automatically and provides real-time monitoring

on ingestion jobs with lifecycle management and alerting capabilities.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

The data sources supported include relational databases, such as Oracle, SQL-Server, PostgreSQL,

MySQL or Db2 and SaaS systems, like Adobe Analytics, Google Analytics, Marketo, Microsoft Dynamics

365, NetSuite, Oracle Fusion Cloud, Salesforce, SAP ECC, SAP S4/HANA, ServiceNow, Workday and

Zendesk. The supported targets include cloud data lakes, like Amazon Redshift, Amazon S3, Databricks

Delta, Google BigQuery, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Microsoft

Azure Synapse Analytics and Snowflake, as well as messaging hubs like Apache Kafka.

Enterprises have data from a variety of sources, such as on-premises les, databases, data warehouses,

streaming data and SaaS systems like ERP and CRM. Data needs to be ingested from all these sources into

a cloud data lake or stream storage like Apache Kafka to be enriched, processed and transformed. Once

the data is in the cloud, rules for data lakes and data quality rules as well as integration are applied to the

data. This then lands the data into the cloud data warehouse to make it ready for advanced analytics and

AI/ML use cases.

The streaming pipeline feeds data for real-time analytics use cases, such as real-time dashboarding and real-

time reporting. Informatica Cloud Data Ingestion and Replication services support the following use cases:

y Cloud Data Warehouse/Cloud Data Lake Ingestion: Enable data ingestion from a variety of sources

— such as data lakes and data warehouses, les, streaming data, IoT data and on-premises database

content — into a cloud data warehouse and cloud data lake to keep the source and target in sync.

y Data Warehouse Modernization/Migration: Bulk ingest data from on-premises databases into a cloud

data warehouse and continuously ingest CDC data to keep the source and target in sync. This applies to

on-premises legacy systems, such as mainframes and relational databases, such as Oracle, IBM DB2,

Microsoft SQL and others.

y Accelerate Messaging Journey for Real-Time Analytics: Ingest data from a variety of sources —

such as logs, clickstream, social media, IoT and CDC data — into Kafka or other messaging systems

for real-time operationalization and reporting use cases.

Informatica Cloud Data Ingestion and Replication services also help customers continuously ingest and

process data from a variety of streaming sources by leveraging open source technologies like Apache

Spark and Apache Kafka. The streaming sources include Amazon Kinesis Streams, AMQP, Azure Events

Hub Kafka, Flat File, Google PubSub, JMS, Kafka, MQTT, OPC UA, REST V2 and targets include Apache

Kinesis, Amazon S3, Databricks Delta, Flat File, Google Cloud Storage V2, Google PubSub, Google BigQuery,

JDBC V2, Kafka, Micorsoft Azure Data Lake Store Gen2, Microsoft Azure Event Hub. It also provides

out-of-the-box capabilities to parse, lter, enrich, aggregate and cleanse streaming data while helping

operationalize machine learning models on streaming data. With these capabilities, customers can

perform real-time analytics on CDC data to address their streaming analytics use cases.

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

Figure 7. Cloud data warehouse/data lake reference architecture.

Conclusion

Real time is the new measurement for digital success, and real-time data has increased in potential

value. The need for immediate, intelligent responses is now paramount. Technologies like CDC can help

companies gain digital superiority by accelerating business innovation and gaining competitive advantage.

CDC technology helps businesses make better decisions, increase sales and improve operational costs.

With Informatica solutions for CDC, companies can quickly move and ingest a large volume of their

enterprise data from a variety of sources onto the cloud or on-premises repositories for processing and

reporting — or onto messaging hubs for real-time analytics with out-of-the-box connectivity.

Next Steps

To learn more about Informatica solutions for application, database, les and streaming ingestion,

visit the Cloud Data Ingestion and Replication webpage or read the following resources:

y Informatica Cloud Data Ingestion and Replication Datasheet

y What is Data Ingestion? Learn Key Use Cases, Capabilities and Tools

informatica.com

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost

Worldwide Headquarters

2100 Seaport Blvd,

Redwood City, CA 94063, USA

Phone: 650.385.5000

Fax: 650.385.5500

Toll-free in the US: 1.800.653.3871

informatica.com

linkedin.com/company/informatica

twitter.com/Informatica

Informatica (NYSE: INFA) brings data and AI to life by empowering

businesses to realize the transformative power of their most critical

assets. When properly unlocked, data becomes a living and trusted

resource that is democratized across your organization, turning chaos

into clarity. Through the Informatica Intelligent Data Management

Cloud™, companies are breathing life into their data to drive bigger

ideas, create improved processes, and reduce costs. Powered by

CLAIRE

, our AI engine, it’s the only cloud dedicated to managing data

of any type, pattern, complexity, or workload across any location — all

on a single platform.

About Us

IN09-3914-0524

States and other countries. A current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html.

Other company and product names may be trade names or trademarks of their respective owners. The information in this documentation

is subject to change without notice and provided “AS IS” without warranty of any kind, express or implied.

How Change Data Capture Speeds Decision-Making and Lowers Operational Cost