Overview
SQL Server Change Data Capture or CDC – points to note:
You need sysadmin privileges to turn this functionality on
Web, Express and Standard editions of SQL Server do not support this functionality
You need to enable CDC at database level and at the table level for all tables that need to be tracked
You need to start the SQL Server Agent.
SQL Server Change Data Capture: Pros and Cons Challenges
A limitation that we see with the SQL Server Change Data Capture functionality is that it uses I/O and has a small overhead on server resources; however, it has no impact on the tables themselves. There are no locks, reads, or anything to block or slow down transactions. There are several other mechanisms that support real-time data replication from SQL Server. When the underlying tables have large volumes of DML activity, SQL Server CDC may not be the optimum solution.
It is harder to set up than other mechanisms and may need some on-going intervention by the DBAs for transaction logs for transactional replication.
Benefits
The benefit with SQL Server Change Data Capture is that most replication mechanisms from SQL Server require the tables to have a primary key. CDC will support tables without a primary key and this is the use case it is most useful for, as incremental data and real-time replication can be achieved without primary keys on the tables to be replicated.
The CDC mechanism can be used for Always On Availability Groups on SQL Server as CDC data is replicated to the replicas.
CDC codes
When the feature is enabled on a table, the change table named cdc.<captured_instance>_CT is automatically created in the tracked database. The table contains a row for each insert and delete on the source table, and two rows for each update. The first one is identical to the row before the update, and the second one to the row after the update. To query the table, use the cdc.fn_cdc_get_all_changes and cdc.fn_cdc_get_net_changes functions.
select * from [cdc].[Production_Products_CT]
Delete
Insert
Updated row before the change
Updated row after the change
Incremental load for inserts (no updates or deletes)
Initial data load
I have DB Salome on the client server. CDC is already enabled for database and the table. We have a look on one table called Production.Products, how to set up the incremental data load.
I have set this solution up for a SSIS packed. There are two cycles involved in this proces
Cycle 1: Initial load, Data Warehouse loading.
Cycle 2: Incremental load, Data Warehouse loading.
Of course, there are multiple other actions, transformations included in this process like DQS. I want to keep this very brief, so I am not going to any specific details.
CDC control task
On the top, we have CDC Control task. All these CDC control task will use a connection manager to the database where CDC is enabled. We will use it couple times, set it as a project level connection.
2: Choose Mark initial load start
3: Create a variable that hold CDC_State, click new and will be automatically created.
4: dbo.cdc_states, click new and this table will be automatically created. Go with default setting.
This table is used to maintain the state of the loading process. It is used to determine and maintain the processing range when using the CDC components in SSIS
5. State name: give it any name you like. That name will be displayed in the dbo.cdc_state table. You can call it "Batman"
Data Flow Task:
We are doing the initial data load from the source database. That is the DB where CDC is enabled. We must connect to it and get data into Staging Area or table for further processing. Destination in this case is OLE DB Destination and we are loading the data form Production.Products table into db. STG_Products.
CDC Control Task 1
In this step all the setting stays the same as in the CDC Control Task. The only difference is that now you must use Mart initial load end in the CDC control operation.
And that was it. The Initial data load is set up, we can now load data from the source table into staging table and then from staging table into data warehouse. After the data are pulled off the staging table, staging table gets truncated, it's empty and ready for the incremental data load package. You can use Sequence Container task in case you have more Data Flows or more table to load.
2. Incremental data load
And again, we are using the same setting as in the beginning. The only change is that in the CDC control operation use Get Processing range. We must connect to to database, where CDC is enabled. We use the same variable and table for cdc state.
Data Flow task
CDC Source
The mission is to find CDC Source among other sources and drag it in. Define connection to the source DB (1) and table (2) where CDC is enabled. CDC processing mode contains more options. I want just Net changes because now I only care about inserts.
2. CDC Splitter
Has by default four available outputs. For this scenario I am using just Insert output. Delete output is for deleted rows from the source table and Update Output is for updated rows.
OLE DB Destination is my table in the staging area where I want to load data from source table for further processing. (dbo.STG_Products)
CDC Control Task 1
And again, I am using the same setting as for each other CDC Control Task in this entire project. The only thing that changes is CDC Control Operation: Mark processed range
CDC control operation:
Initial load: Mark initial load start, Mark initial load end
Incremental load: Get processing range, Mark processing range.
I won't be doing too much demonstration for this type of scenario. I have added one more record to the source table, going to run incremental load package and data warehouse loading then we can check the tables.
Incremental load for insert, update, delete
(SCD Type 1: overwriting the history for an attribute)
There is a table Customer.Customers on the source database with CDC enabled. I want to create a SSIS package using SQL server CDC feature to capture inserts, updates and deletes from this one table. All the records will go to the staging area and from there to data warehouse. Staging DB and DW DB are on the same server instance in diffrent databases.
Staging area contains 3 tables that will handle changes in Customer.Customers source table. (CDC_Splitter)
Stg_Customers - INSERTS
STG_UPDATE_Customers - UPDATES
STG_DELETE_Customers - DELETES
SSIS - Initial data loading
CDC Control Task : Mark initial load start
Connection Managers pointing to the source database Salome.
Data Flow Task
Initial load goes form Customer.Customers to dbo.STG_Customers. When this package executes it moves data from source to staging area.
CDC Control Task 1
CDC Control Task 1: Mark initial load end
SSIS - Initial data loading - Data Warehouse
Now we got data in the staging area, and we need to move them into DW.
STG_Customers contains inserts. Updates and deletes will be set up in the next package which is for incremental data loading.
Data Flow Task
This package is for Data Warehouse loading. I will explain Execute SQL Task 1 on the end. Data Flow task contains two tasks. OLE DB Source and Destination. Source is dbo.STG_Customers and destination dbo.DimCustomers. This Data Flow task loads all the data from the staging table into data warehouse table.
Execute SQL Task
When the data are moved from all three staging tables into DimCustomerst table, staging tables get truncated.
Launching the package...
By now data warehouse table DimCustomer is loaded for the first time.
SSIS - Incremental data loading
let' performer couple of modifications on the source table....
2. and while these DML statements are running on the source table we can set up the incremental data load package.
CDC Control Task: Get processing range
Data Flow Task
First task is CDC Source; our source is Customer.Customers table.
CDC processing mode: NET: I want to retrieve just the last change, final modification. If some attribute was updated ten times - I do not care - I want just the final value of that attribute.
CDC Splitter contains four outputs: error, inserted, deleted, updated. I have created three tables and each modification will be caught by separate table.
These are the three tables. All three in staging area waiting for inserts, updates, and deletes from the source table. (STG_Customers could be theoretically DimCustomers table, and all the inserts would go directly into data warehouse. It is up to your how it will be all set up. CDC is very flexible and needs lot of T-SQL modifications)
CDC Control Task 1: CDC control operation: Mark processed range
...let's run the package: Incremental data load
We have 1 delete, 1 update, 1 insert.
There are some new data in the staging tables. You can clearly see how stupid this solution is. One table needs another 3 tables to catch its changes. Imagine you have 100 tables like this. Staging and DW databases resign in the same SQL Server instance. You can use stored procedure to move the data from one place to another. You can load deleted rows into separate table in DW and keep the historical changes. You do not have to delete them. But I will...
You can see the data from DW and Staging Area on this picture. I am building a solution where all the inserted rows will go into DW. If there were updates on the source, DW will be updated as well. If rows were deleted on the source, matching rows on DW will be deleted.
I will be using an Execute SQL Task on the DW loading package to perform some DML actions.
Update statement will update DimCustomers table with data from STG_UPDATE_Customers. This table holds the updated rows from the source table.
Delete statement will delete all the deleted rows in Customer.Customers from DimCustomers.
All the inserts in the source table will be loaded into DW.
Loading data form Staging Arean into Data Warehouse
Execute SQL Task 1
Data Flow Task
Execute SQL Task will truncate all the staging tables
After DW have been loaded, the data looks as follow:
DimCustomers table looks exactly as Customer.Customers source table. All staging tables was truncated.
Opmerkingen