White Paper
How Change Data
Capture Speeds
Decision-Making
and Lowers
Operational Costs
Support Streaming Analytics and AI/ML
Use Cases in Real Time
Contents
The Evolution of Modern Data Architectures – Key Trends and Drivers 3
The Need for Real-Time Change Data Capture 5
Change Data Capture Use Cases 6
Benets of Change Data Capture 10
Methods of Change Data Capture 11
Change Data Capture with Informatica Intelligent Data Management Cloud™ 12
Conclusion 14
Next Steps 14
About Informatica 15
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
2
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
3
The Evolution of Modern Data Architectures – Key Trends and Drivers
Data is at the core of how modern enterprises run their businesses and is a crucial enabler in driving
digital transformation. Digital transformation has never been more critical than it is today, as the pace
of disruption is only accelerating. According to a recent study from Innosight,
1
the 30 to 35-year average
tenure of S&P 500 companies in the late 1970s is forecast to shrink to 15 to 20 years this decade.
Organizations are trying to become data-centric, but the traditional approaches don’t scale and don’t
provide insights that are required to drive innovation. Over time, enterprises accumulate terabytes and
petabytes of data stored in on-premises databases, ERP and CRM systems. They collect the data, run
ETL jobs and ingest data into a data warehouse such as Teradata, SQL server or an Oracle warehouse.
And when the data increases, they add more data warehouse appliances. The challenge with this approach
is that it creates data silos. As a result, organizations are unable to create end-to-end, 360-degree views of
their customers, markets and products.
With a modern data architecture, organizations can take advantage of exponential data growth and gain
the benets of end-to-end analytics insights. Migrating from a legacy on-premises data warehouse to a
cloud data warehouse and cloud data lake provides benets such as performance, availability, cost,
manageability and flexibility without compromising security.
Data architecture is going through three fundamental shifts that are disrupting traditional methods of
handling, analyzing and structuring data.
1. Data Warehouse to Data Lake/Lakehouse
A data lake is a strong complement to a data warehouse. And many enterprises are now adopting a new
combined architecture, the “lakehouse.” A lakehouse merges data warehouses and data lakes in one data
platform. A lakehouse brings the best of both worlds together by combining technologies for business
analytics and decision-making with those for exploratory analytics and data science.
The data lake provides cost-effective processing and storage, which is distributable, highly available and
can store data without applying a schema to it. Instead, the schema can be applied later to read the data
for analytics consumption. You can store many different data types: structured, unstructured or semi-
structured. Data lakes are critical for organizations that want to be innovative and intend to address
articial intelligence (AI) and machine learning (ML) use cases.
1
Innosight, 2021 Corporate Longevity Report, 2021
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
4
2. Batch Processing to Stream Processing
While there will always be a place for batch processing, there is a notable increase in the demand for
streaming content; the need for capture and analysis of real-time data increases as the value of time-
sensitive data increases. With the adoption of Kappa architecture and other streaming-rst architectural
patterns, stream processing has become mainstream.
2
Real-time processing of customer data can create
new revenue opportunities, and tracking and analyzing IoT sensor data can improve operational efciency.
Batch processing can also be combined with stream processing to enrich the content even more.
Whether it is for strategic decisions or a moment-based decision, stream processing enables organizations
to make accurate and faster decisions based on fresh data. For example, stream processing enables you to
identify crosssell opportunities when a customer walks into a store. Real-time stream processing helps to
capture the customer’s location and integrate location data in real time with historical insights from batch
data to provide the correct inmoment cross-sell opportunity.
3. On-Premises to Cloud
Cloud has become mainstream as security concerns about the technology have abated in most industries.
Resource elasticity and cost advantages have made cloud a signicant component of multi-datacenter
architectures. According to a 2022 Flexera report, enterprises are running 49% of workloads and storing
46% of data in a public cloud.
3
These technology trends enable enterprises to realize benets such as agility, flexibility and efciency, as
well as innovation. Businesses can now get better insights from their data and offer the right opportunities
to the right individuals with a seamless experience. These fundamental shifts in data architecture are
opening up new use cases that were not possible with traditional data management approaches. This is
especially true of real-time streaming analytics use cases in the cloud. The Venn diagram below shows the
overlapping use cases for data lakes, streaming and cloud.
2
Informatica, “Kappa Architecture – Easy Adoption with Informatica Streaming Data Management Solution
3
Flexera, State of the Cloud Report, 2022
Data Lake
Streaming Cloud
Overlapping Use Cases:
• Cloud Data Ingestion
and Replication
• Streaming Analytics
• AI/ML
Files Relational Systems Legacy Systems Data Warehouse
Figure 1. A Venn diagram showing overlapping use cases.
The Need for Real-Time Change Data Capture
Today, just about every industry — healthcare, retail, telco, banking, etc. — is being transformed by data.
As data continues to grow, the need for advanced data engineering architectures that combine data lakes,
cloud and streaming becomes critical. This data is also time-bound: data is created in real time and its
value diminishes over time. Organizations need to take immediate action on their data when it is fresh, or
else they will lose out on business opportunities.
What Is Change Data Capture?
Data with timely value comes from various sources such as log les, machine logs, IoT devices, weblogs,
social media, etc. To ensure you don’t miss the opportunities for real-time insights, it’s essential to have a
means to rapidly capture data changes and updates from transactional data sources. Change data capture
(CDC) is a design pattern that allows users to detect changes at the data source and then apply them
throughout the enterprise.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
5
In the case of relational transactional databases, CDC technology helps customers capture the changes in
a system as they happen and propagate the changes onto analytical systems for real-time processing and
analytics. For example, say you have two databases (source and target) and you update a data point in the
source database. Now, you would like to have the same change to be reflected in the target database. With
CDC, you can collect transactional data manipulation language (DML) and data denition language (DDL)
instructions (for example, insert, update, delete, create, modify, etc.) to keep target systems in sync with the
source system by replication of these operations in near real time.
CDC is used for continuous, incremental data synchronization and loading. CDC can keep multiple systems
in sync as well as monitor the source data for schema changes. CDC can dynamically modify the data
stream to accommodate schema changes, which is important for different types of data coming in from
live data sources. CDC continuously captures real-time changes in data in seconds or minutes. The data
is then ingested into target systems such as cloud data warehouses and data lakes or cloud messaging
systems, helping organizations to develop actionable data for advanced analytics and AI/ML use cases.
Change Data Capture Use Cases
CDC is used for various use cases such as synchronization of databases (a traditional use case) and
real-time streaming analytics and cloud data lake ingestion (more modern use cases).
Figure 2. Examples of CDC use cases.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
6
Real-Time Database Synchronization
Imagine you have an online system that is continuously updating your application database. Let’s say you
have a customer who is registering on your web application. As part of registration, the customer will need
to provide information such as name, age and telephone number. That same record is now created in your
source systems.
Once the record is created in the source system, the change data is enabled. Essentially, the system will
submit that change event out to some form of a listener and that listener can then process that same
change and create a record in the target system. Now, the transactions in both the source and target
system will have the same customer records as the data is synchronized. This is an example of when
a new record is created.
With CDC, we can also capture incremental changes to the record and schema drift. So, imagine the same
customer comes back and updates some information — like the telephone number. The changed telephone
number gets updated in the source system. CDC will capture this event again and update the record in the
target database in real time. You can also dene how to treat the changes (for example, replicate or ignore
them). This is an elementary example of data synchronization using CDC technology.
Reference architectures can become complex in large enterprises, where different teams are working on
different sets of technologies and each has its requirements. When you apply CDC methodology, you will
need to modify the application processing mechanism to create a unied solution across all systems. As
your data is loaded into different types of discrete systems, you may need to apply in-memory, ltration
transformation and other types of operators to the data. Your data is processed and pushed down to the
system that is the target for your CDCbased needs. You may want to process your data in batch (this is
done mostly for the rst time you load your data from source system to your target system). Once you
have your initial data loaded, you may want to process the incremental changes in real time.
You can override the schema drift options when you resume a database ingestion job that is in the stopped,
aborted, or failed state. The overrides affect only those tables that are currently in the error state because
of the Stop Table or Stop Job Schema Drift option.
CDC also can transmit source schema/DDL updates into message streams and integrate with messaging
schema registries to ensure that analytics consumers understand the metadata.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
7
Figure 3. Synchronization of traditional database with change data capture.
Modern Real-Time Streaming Analytics and Cloud Data Lake Ingestion
In a modern data architecture, customers can continuously ingest CDC data into a data lake in the form of
an automated data pipeline. CDC can propagate data to message queues to help analyze streaming log
data for any kind of issue. In this case, data is pushed to a messaging queue like Apache Kafka or Amazon
Kinesis as a target for reading and processing data. CDC can also be leveraged for data migration from
on-premises to cloud, advanced analytics and AI/ML use cases by analyzing data to generate real-time
insights and data modernization initiatives.
CDC helps avoid some of the bottlenecks you may encounter when provisioning large amounts of data
from legacy data stores to the new data lake as the data only needs to be loaded once; thereafter, CDC
just provisions the changes to the data lake. In fact, many organizations have used data lake over ETL
platforms, as the data lake environment tends to be less expensive.
The most difcult part of the data lake is maintaining it with current data. CDC can be extremely helpful
there. It can help you save computing and network costs, especially in the case of cloud targets, as you are
not moving terabytes of data unnecessarily across your network. You can focus on the change in the data.
With support for technologies like Apache Spark for real-time processing, CDC is the underlying technology
for driving advanced real-time analytics and AI/ML use cases.
Customers can also take advantage of CDC technology coupled with streaming ingestion to address real-
time analytics use cases. The streaming ingestion scenarios could include: IoT data ingestion, clickstream
ingestion or log les ingestion. Now, let’s deep dive into two such real world use cases: real-time fraud
detection at a bank and real-time inventory analysis at a retailer.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
8
Real-Time Fraud Detection
A large corporate bank was facing challenges with a sudden increase in fraudulent activities, which resulted
in unhappy customers and loss of business. The bank wanted to build a real-time analytics platform to
proactively alert customers about potential fraud so the customers could take remedial actions. To do
so, the bank needed to ingest transactional information from its database in real time and apply a fraud
detection ML model to identify potentially fraudulent transactions.
The Informatica Intelligent Data Management Cloud™ (IDMC) is the industry’s most comprehensive, AI-
powered, end-to-end data management platform. It offers key capabilities required to support real-time
streaming analytics and data replication with versatile connectivity through cloud data ingestion and
replication services. As part of IDMC, the Informatica CDC services capture the change data in real time
from the transactional database and publish it into Apache Kafka. The Informatica streaming ingestion
services read this data and then enrich the data for real-time fraud analytics that enable the fraud
monitoring tool to proactively send text and email alerts to customers about potential fraud detection.
As a result, the bank improves customer experience, thereby helping to retain and grow its customer base.
Figure 4. Reference architecture for real-time fraud detection.
Optimizing Inventory for Operational Readiness
A retail giant that operates several supermarkets and multi-department stores was struggling with poor
inventory management. This resulted in a failure to meet demand and a loss of customers. It also posed a
question on the retailer’s operational readiness. The retailer needed the point-of-sales data from all the stores
across the country at one place in real time to assist with inventory analysis to run the stores efciently.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
9
The Informatica CDC services capture the change data in real time from IoT devices and feed it to the IBM MQ
from where Informatica Streaming Ingestion services enriches the data and transfers it to a cloud data lake.
This data is then used for real-time downstream analytics to inform key decisions for managing the inventory.
Figure 5. Reference architecture for real-time inventory analysis.
Benets of Change Data Capture
CDC captures changes from the database transaction log and publishes them to a destination such as a
cloud data lake, cloud data warehouse or message hub. This has several benets for the organization.
1. Enables Faster Decision Making: The biggest advantage of CDC technology is that it fuels data for
real-time analytics. This helps organizations make faster and more accurate decisions in real time by
capitalizing on fast-moving data.
This means nding data, analyzing it and acting on it in real time. Organizations can create hyper-personal,
real-time digital experiences for their customers with real-time analytics. For example, real-time analytics
can enable restaurants to create a personalized menu for individual customers based on historical data,
along with data from mobile or wearable devices to provide customers with the best deals and offers.
To look at another example, an online retailer wants to sell more motorcycle helmets and maximize prots.
The retailer detects a pattern in real time that when a user views at least three motorcycle safety products,
including at least one helmet, analytics indicate that the customer is interested.
The retailer then displays the most protable helmets. Now, add a time window parameter to this to nd the
real-time total sales of motorcycle helmets. The reason to do this is to move the price up and down. The rst
goal is to sell more motorcycle helmets, and the second goal is to maximize protability. If sales are trending
lower than usual, and this customer is price sensitive, then the retailer dynamically lowers the price.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
10
2. Minimizes Impact on Production: When done through a tool, CDC brings many advantages compared
to script based or hand-coded implementation. Since it is time-consuming to move data from source to
the production server, CDC captures incremental updates with a minimal source-to-target impact. It can
read and consume incremental changes in real time to continuously feed the analytics target without
disrupting production databases. This helps to scale efciently to execute high-volume data transfers to
the analytics target.
3. Improves Time to Value and Lowers TCO: CDC enables you to build your offline pipeline faster
without worrying about scripting. It helps data engineers and data architects to focus on important tasks.
It also helps minimize total cost of ownership by removing the dependency on highly skilled users for
these applications.
Methods of Change Data Capture
There are several methods and technologies to achieve CDC, and each has its merit depending on the
use case. Here are the common methods, how they work and their advantages and disadvantages.
Timestamps: The simplest way to implement CDC is to use a timestamp column within a table. This
technique depends upon a timestamp eld being available in the source table(s) to identify and extract
change datasets. At minimum, one timestamp eld is required for implementing timestamp-based CDC.
In some source systems, there are two timestamp source elds — one to store the time at which the record
was created, and another eld to store the time at which the record was last changed. The timestamp
column should be changed every time there is a change in a row.
Timestamps are the easiest to implement and most widely used CDC technique for extracting incremental
data. However, this approach only retrieves rows that have been changed since the data was last extracted.
There may be issues with the integrity of the data in this method. For instance, if a row in the table has
been deleted, there will be no DATE_MODIFIED column for this row, and the deletion will not be captured.
This method can also slow production performance by consuming source CPU cycles.
Triggers: Another method for building CDC at the application level is dening triggers and creating your
own change log in shadow tables. Shadow tables may store the entire row to keep track of every single
column change, or they may store only the primary key and operation type (insert, update, or delete).
Using triggers for CDC has the following drawbacks:
y Increases processing overhead
y Slows down source production operations
y Impacts application performance
y Is often not allowed by database administrators
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
11
Log-Based CDC: Transactional databases store all changes in a transaction log that helps the database
to recover in the event of a crash. With log-based CDC, new database transactions — including inserts,
updates and deletes — are read from source databases’ transactions. Changes are captured without
making application-level changes and without having to scan operational tables, both of which add
additional workload and reduce source systems’ performance. Log-based CDC is the most preferred,
fastest and least disruptive CDC method because it requires no additional modications to existing
databases or applications.
Change Data Capture with Informatica Intelligent Data Management Cloud™
SaaS Apps
Sources
Real-time /
Streaming
Sources
On-premises
Sources
DATA SOURCES
+ +
Mainframe Applications Databases
IoT Machine Data Logs
AI-Powered Metadata Intelligence & Automation
Connectivity
Metadata System of Record
DATA
CATALOG
DISCOVER &
UNDERSTAND
DATA
MARKETPLACE
SHARE &
DEMOCRATIZE
GOVERNANCE,
ACCESS & PRIVACY
GOVERN &
PROTECT
MDM & 360
APPLICATIONS
MASTER &
RELATE
DATA QUALITY &
OBSERVABILITY
CLEANSE &
TRUST
API & APP
INTEGRATION
CONNECT &
AUTOMATE
DATA INTEGRATION &
ENGINEERING
ACCESS &
INTEGRATE
ETL Developer Data Engineer Citizen Integrator Data Scientist Data Analyst Business Users
DATA CONSUMERS
Intelligent Data Management Cloud
Figure 6. Informatica IDMC is the industry’s most comprehensive solution for multi-cloud, modern data management.
Informatica Intelligent Data Management Cloud™ (IDMC) is the industry’s most comprehensive, AI-powered,
end-to-end data management platform. It offers key capabilities required to support real-time streaming
analytics and data replication with versatile connectivity through cloud data ingestion and replication services.
The Informatica Cloud Data Ingestion and Replication service provides database and application ingestion
capabilities that help you ingest initial and incremental loads. These could come from relational databases
and SaaS systems onto cloud data lakes and cloud data warehouses as well as messaging hubs.
Regardless, the changes in data are being captured in real time. It also offers schema drift capabilities
to help customers manage changes in the schema automatically and provides real-time monitoring
on ingestion jobs with lifecycle management and alerting capabilities.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
12
The data sources supported include relational databases, such as Oracle, SQL-Server, PostgreSQL,
MySQL or Db2 and SaaS systems, like Adobe Analytics, Google Analytics, Marketo, Microsoft Dynamics
365, NetSuite, Oracle Fusion Cloud, Salesforce, SAP ECC, SAP S4/HANA, ServiceNow, Workday and
Zendesk. The supported targets include cloud data lakes, like Amazon Redshift, Amazon S3, Databricks
Delta, Google BigQuery, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Microsoft
Azure Synapse Analytics and Snowflake, as well as messaging hubs like Apache Kafka.
Enterprises have data from a variety of sources, such as on-premises les, databases, data warehouses,
streaming data and SaaS systems like ERP and CRM. Data needs to be ingested from all these sources into
a cloud data lake or stream storage like Apache Kafka to be enriched, processed and transformed. Once
the data is in the cloud, rules for data lakes and data quality rules as well as integration are applied to the
data. This then lands the data into the cloud data warehouse to make it ready for advanced analytics and
AI/ML use cases.
The streaming pipeline feeds data for real-time analytics use cases, such as real-time dashboarding and real-
time reporting. Informatica Cloud Data Ingestion and Replication services support the following use cases:
y Cloud Data Warehouse/Cloud Data Lake Ingestion: Enable data ingestion from a variety of sources
— such as data lakes and data warehouses, les, streaming data, IoT data and on-premises database
content — into a cloud data warehouse and cloud data lake to keep the source and target in sync.
y Data Warehouse Modernization/Migration: Bulk ingest data from on-premises databases into a cloud
data warehouse and continuously ingest CDC data to keep the source and target in sync. This applies to
on-premises legacy systems, such as mainframes and relational databases, such as Oracle, IBM DB2,
Microsoft SQL and others.
y Accelerate Messaging Journey for Real-Time Analytics: Ingest data from a variety of sources —
such as logs, clickstream, social media, IoT and CDC data — into Kafka or other messaging systems
for real-time operationalization and reporting use cases.
Informatica Cloud Data Ingestion and Replication services also help customers continuously ingest and
process data from a variety of streaming sources by leveraging open source technologies like Apache
Spark and Apache Kafka. The streaming sources include Amazon Kinesis Streams, AMQP, Azure Events
Hub Kafka, Flat File, Google PubSub, JMS, Kafka, MQTT, OPC UA, REST V2 and targets include Apache
Kinesis, Amazon S3, Databricks Delta, Flat File, Google Cloud Storage V2, Google PubSub, Google BigQuery,
JDBC V2, Kafka, Micorsoft Azure Data Lake Store Gen2, Microsoft Azure Event Hub. It also provides
out-of-the-box capabilities to parse, lter, enrich, aggregate and cleanse streaming data while helping
operationalize machine learning models on streaming data. With these capabilities, customers can
perform real-time analytics on CDC data to address their streaming analytics use cases.
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
13
Figure 7. Cloud data warehouse/data lake reference architecture.
Conclusion
Real time is the new measurement for digital success, and real-time data has increased in potential
value. The need for immediate, intelligent responses is now paramount. Technologies like CDC can help
companies gain digital superiority by accelerating business innovation and gaining competitive advantage.
CDC technology helps businesses make better decisions, increase sales and improve operational costs.
With Informatica solutions for CDC, companies can quickly move and ingest a large volume of their
enterprise data from a variety of sources onto the cloud or on-premises repositories for processing and
reporting — or onto messaging hubs for real-time analytics with out-of-the-box connectivity.
Next Steps
To learn more about Informatica solutions for application, database, les and streaming ingestion,
visit the Cloud Data Ingestion and Replication webpage or read the following resources:
y Informatica Cloud Data Ingestion and Replication Datasheet
y What is Data Ingestion? Learn Key Use Cases, Capabilities and Tools
informatica.com
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost
14
Worldwide Headquarters
2100 Seaport Blvd,
Redwood City, CA 94063, USA
Phone: 650.385.5000
Fax: 650.385.5500
Toll-free in the US: 1.800.653.3871
informatica.com
linkedin.com/company/informatica
twitter.com/Informatica
Informatica (NYSE: INFA) brings data and AI to life by empowering
businesses to realize the transformative power of their most critical
assets. When properly unlocked, data becomes a living and trusted
resource that is democratized across your organization, turning chaos
into clarity. Through the Informatica Intelligent Data Management
Cloud™, companies are breathing life into their data to drive bigger
ideas, create improved processes, and reduce costs. Powered by
CLAIRE
®
, our AI engine, it’s the only cloud dedicated to managing data
of any type, pattern, complexity, or workload across any location — all
on a single platform.
CONTACT US
About Us
IN09-3914-0524
© Copyright Informatica LLC 2024. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United
States and other countries. A current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html.
Other company and product names may be trade names or trademarks of their respective owners. The information in this documentation
is subject to change without notice and provided “AS IS” without warranty of any kind, express or implied.
How Change Data Capture Speeds Decision-Making and Lowers Operational Cost