Migrating From Teradata To Azure SQL Data Warehouse
Migrating From Teradata To Azure SQL Data Warehouse
Reviewed by:
Jim Toland – Senior Program Manager
Mitch van Huuksloot – Solution Architect
Jim Rosengarth – Solution Architect
Migrating from Teradata to Azure SQL Data Warehouse
Copyright
This document is provided "as-is". Information and views expressed in this document, including URL and other
Internet Web site references, may change without notice. Some examples depicted herein are provided for
illustration only and are fictitious. No real association or connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any Microsoft product.
You may copy, use, and modify this document for your internal, reference purposes.
Microsoft, Azure, Hyper-V, Internet Explorer, Silverlight, SQL Server, Windows, Windows PowerShell, and Windows
Server are trademarks of the Microsoft group of companies. All other trademarks are property of their respective
owners.
Revision History
Release Date Changes
July 12, 2018 Released the first version of this guide
July 18, 2018 Updated proof of concept approach
July 20, 2018 Corrected flow of document and explained POC workflow
July 23, 2018 Added Executive Summary
August 23, 2018 Removed Scripts to be hosted externally on GitHub and changes offline data migration workflow
September 15, 2018 Review by Centre of Excellence Team
September 18, 2018 Flow rewritten and Next Steps section
October 18, 2018 V1 General Release
Table of Contents
Executive Summary......................................................................................................................................................5
Migration Phases...........................................................................................................................................................8
Phase One: Fact Finding.................................................................................................................................8
Phase Two: Defining Success Criteria for a Proof of Concept (Optional)...................................................10
Phase Three: Data Layering Mapping Options.............................................................................................11
Phase Four: Data Modelling.........................................................................................................................12
Phase Five: Identify Migration Paths............................................................................................................13
Bringing it all together in the Execution of Migration.................................................................................13
Architectures...............................................................................................................................................................17
Data Migration.......................................................................................................................................................17
Modern Data Warehouse.......................................................................................................................................17
What is the best approach?....................................................................................................................................18
Preparation for Data Migration...................................................................................................................................19
Schema Migration.................................................................................................................................................19
Introduction...................................................................................................................................................19
Create a Teradata Database Azure VM........................................................................................................20
Create Jumpbox VM.....................................................................................................................................20
Create an Azure SQL Data Warehouse Database.........................................................................................20
Creating the Teradata Schema in Azure.......................................................................................................21
Use Attunity Replicate to Create Schema.....................................................................................................21
Extract Schema.............................................................................................................................................22
Migrate Business Logic...............................................................................................................................................23
Metadata Approaches............................................................................................................................................23
Phase One: Lift and Shift..............................................................................................................................23
Phase Two: Rewrite Business Logic.............................................................................................................24
What is the best approach?...........................................................................................................................24
Example of Business Logic Migration..................................................................................................................24
Semantic Layer and Reporting Views...................................................................................................................25
What is the best approach?...........................................................................................................................25
Data Migration............................................................................................................................................................27
Execution...............................................................................................................................................................27
Source-controlled pipeline............................................................................................................................27
Azure Controlled pipeline.............................................................................................................................27
What is the best approach?...........................................................................................................................27
Differentiated or Integrated Export Process.................................................................................................28
What is the best approach?...........................................................................................................................29
Suggested Azure Data Factory Pipeline Configuration.........................................................................................30
End-to-End Workflow..................................................................................................................................31
Next Steps...................................................................................................................................................................33
Executive Summary
This document gives insight into how to approach a Teradata to Azure SQL Data Warehouse migration.
This whitepaper is broken into sections which detail the migration phases, the preparation required for data
migration including schema migration, migration of the business logic, the actual data migration approach and
testing strategy.
The migration should pivot on six areas (one is optional but highly recommended) and, with the benefit of
Azure, you can get ahead of the development curve by quickly being able to provision Azure SQL Data
Warehouses for your development team to start business object migration before the data is migrated.
The use of an agile methodology, with focus on what must be delivered, to provide a working environment and
any items that are “nice to haves” will achieve a foundation from which to build upon within Azure.
The illustration below shows the different areas and they all build on top of each other: -
Defining
Success Identify
Data Data Execution of
Fact Finding Criteria for a Migration
Layering Modelling Migration
Proof of Paths
Concept
Through a question and answers session you can help define what your inputs and outputs are for the migration
project. This is the collecting requirements in an Agile methodology.
Taking the answers from phase one, you decide upon running a POC to validate the outputs required and run the
following phases as a POC. If you have decided upon a live migration, then you would skip the POC stage (not
recommended). This should feed into the Agile Project Charter and establish the team norms.
This phase is about mapping the data you have in Teradata to the data layout you will create in Azure SQL Data
Warehouse.
Once you’ve defined the data mappings, phase four concentrates on how to tune Azure SQL Data Warehouse to
provide the best performance for the data you will be landing into it.
Depending on the state of the data warehouse you might need to deploy a team to adjust the schema into a star
or snowflake data model. This layout can bring performance benefits to both the data and reporting layers.
What is the path of least resistance? What is the quickest path given your cloud maturity? Phase Five helps
describe the options open to you and then for you to decide on the path you wish to take. With Azure you have
two different paths – export to Azure Databox and ship to the Azure Datacentre or use Azure ExpressRoute to
transfer from your data centre to Azure’s datacentre.
The data export file types can make a difference to the export, transfer and import to Azure SQL Data
Warehouse – these have been listed in the detailed section.
Migrating your Teradata data to SQL Data Warehouse involves a series of steps. These steps are executed in
three logical stages: Preparation, Metadata migration and Data migration which have been defined in the
previous phases and are combined into Phase Six: Execution of Migration.
Once migrated, you should have a robust testing plans to validate the solution in a test environment before going
live in production.
Migration Phases
Phase One: Fact Finding
Phase 1 is all about fact finding and understanding if the workload is new or a migration project. This is the
collecting requirements phase in Agile.
Before migrating, you want to be certain SQL Data Warehouse is the right solution for your workload. SQL
Data Warehouse is a distributed system, designed to perform analytics on large volumes of data. Migrating to
SQL Data Warehouse requires some design changes that are not too hard to understand, but might take some
time to implement. If your business requires an enterprise-class data warehouse (DW), the benefits are worth the
effort. However, if you don't need the power of SQL Data Warehouse, it is more cost-effective to use SQL
Server or Azure SQL Database.
Consider using SQL Data Warehouse when you:
Have one or more Terabytes of data
Plan to run analytics on large amounts of data
Need the ability to scale compute and storage
Want to save costs by pausing compute resources when you don't need them.
Don't use SQL Data Warehouse for operational (OLTP) workloads that have:
High frequency reads and writes
Large numbers of singleton selects
High volumes of single row inserts
Row by row processing needs
Incompatible formats (JSON, XML)
Strategic Questions
▪ Is this a Lift & Shift Migration of On-Premises Data Warehouse or a Green Field Project?
▪ What Data Size Raw Capacity (Storage Level) is required?
▪ What Eco-System Processes are in place to be migrated as part of Data Migration?
▪ What are Data Growth Expectations in 2 – 3 years’ time?
▪ Would it be 2-Tier Data Warehouse or Traditional 3-Tier Data Warehouse with (Acquisition /
Ingestion Layer, Integrated Data / Core Layer and Access / View / Semantic Layer)
▪ What are the expectations around?
− Concurrency (Number of Users at Peak Times and Tactical Workload Queries for ingress and
egress of Data alongside in-place Transformations, reference data management and BI Query
Workload)
− Scalability – Ties to the concurrency above + Total daily ingestion in bytes
− Availability – Tech. and Business SLA driven
− Resilience – Disaster Recovery and RTOs and RPOs (Off site cross region Active / Passive
replicated / synced Database
− Protection – Data Backup and Archival – How often, what Strategy and SLAs around
▪ Are there any thoughts around deploying Semantic Layer via Dimensional or Tabular Data Models
(SSAS?)
▪ How many ETL jobs do you have?
▪ What BI Layer would be used (PowerBI?)
After you have the above questions and some of the bottom ones answered we can then look to sizing the Data
Warehouse.
Finally, through the course of the POC you will identify optimisation and fixes which will need to be applied to
the migration. This all drives the cost to migration, but it is important to categorise and show the return on
investment by implementing these changes. Some changes will be mandatory such as identity and networking
infrastructure. These mandatory requirements are usually referred to as Cloud Operating Model.
Consider this as part of your Agile Project Charter and then take the information in this and the previous phase
to build out your agile project. Run two weekly sprints with a well-defined backlog. The first sprint would
concentrate on getting the infrastructure ready. From sprint two the team would be focussing on schema and
data migration.
Replicated stores a copy of the table on each Compute node. Replicated tables are performant because
they do not require data movement for joins and aggregations. They do require extra storage, and
therefore work best for smaller tables.
Hash distributed distributes the rows across all the nodes via a hash function. Hash distributed tables are
the heart of SQL Data Warehouse since they are designed to provide high query performance on large
tables. This option requires some planning to select the best column on which to distribute the data.
However, if you don't choose the best column the first time, you can easily re-distribute the data on a
different column.
To choose the best distribution option for each table, see Distributed tables.
Source controlled pipeline - data export, transfer and import steps runs mostly from the source data
system
Azure controlled pipeline - data export, transfer and import steps runs mostly from Azure
Differentiated Export - data export, transfer and import are distinctly executed with each reading from or
writing to intermediate files
Integrated Export - data export, transfer and import are combined, and entities are transferred directly
from the source data system to Azure SQL Data Warehouse with no intermediate files created
Where possible, automate as much of the export as possible. This enables repeatable exports and can be used for
both the proof of concept and the live migration.
Use Polybase to import the data to Azure SQL Data Warehouse. PolyBase is a technology that accesses data
outside of the database via the T-SQL language. It is the best way to load data into SQL Data Warehouse. With
PolyBase, the data loads in parallel from the data source directly to the compute nodes.
PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files. In addition to the delimited text files,
it loads from the Hadoop file formats RC File, ORC, and Parquet. PolyBase can load data from Gzip and
Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-width format, and nested
formats such as WinZip, JSON, and XML.
After collecting information, combining and making the key decisions in phases one to five you must move onto
the Execution of the Migration. Migrating your Teradata data to SQL Data Warehouse are executed in three
logical stages: Preparation, Metadata migration and Data migration.
Preparation
Logical Progress
Metadata Migration
Data Migration
In each stage, tasks to be executed involve the on-premises database system, the on-premises local storage, the
network connecting the local system to Azure (either internet or a dedicated circuit) and SQL Data Warehouse.
This results in a physical data movement from the source database to Azure as shown below. (These steps are
also similar in moving data from any other source system on cloud instead of on-premises to SQL Data
Warehouse)
Physical Movement
Bringing these two diagrams together results in a logical flow from top to bottom and a physical flow from left
to right.
ExpressRoute
Metadata:
Metadata: Metadata: Metadata:
Compatibility
Export Copy to Azure Import
Changes
Data:
Data: Data: Data:
Compatibility
Filter and Extract Copy to Azure Load
Changes
Data: Data:
Conversions Tansformation
Data Migration
Data: Data:
Compression Optimizations
If the volume of the data to migrate is large, some steps can be time consuming. These steps are rate-
determining because they influence the overall migration time. Such steps are shaded in yellow.
Some migration steps may be optional depending on the size of the data, the nature of the network, and the tools
and services used for migration. Optional steps are shown with orange dots.
Architectures
Data Migration
To ingest historical data, you need a basic cloud Cloud Data Warehouse setup for moving data from your on-
premise solution to Azure SQL Data Warehouse and to enable the development team to build Azure Analysis
Cubes once the majority of the data is loaded.
1. Azure Data Factory Pipeline is used to ingest and move data through the store, prep and train pipeline.
2. Extact and Load files, via Polybase, into the Staging Schema on Azure SQL DW.
3. Transform data through Staging, Source (ODS), EDW and Sematic Schemas on Azure SQL DW.
4. Azure Analysis Services will be used as the sematic layer to serve thousands of end users and scale out
Azure SQL DW concurrency.
5. Build operational reports and analytical dashboards on top of Azure Analysis Services to serve thousands of
end users via Power BI.
Note: You should provision a testing virtual machine in Azure, with accelerated
networking enabled. This will be used to execute ad-hoc test queries. You can then look at
the internals of Azure SQL DW to see the latency/execution times.
Giving you the ability to easily add more data feeds which are then moved through the pipeline up to the Azure
SQL Data Warehouse layer and to your analytics/reporting layer.
1. Combine all your structured, unstructured and semi-structured data (logs, files, and media) using Azure Data
Factory to Azure Blob Storage.
6. Cleansed and transformed data can be moved to Azure SQL Data Warehouse to combine with existing
structured data, creating one hub for all your data. Leverage native connectors between Azure Databricks
and Azure SQL Data Warehouse to access and move data at scale.
7. Leverage data in Azure Blob Storage to perform scalable analytics with Azure Databricks and achieve
cleansed and transformed data.
8. Build operational reports and analytical dashboards on top of Azure Data Warehouse to derive insights from
the data and use Azure Analysis Services to serve thousands of end users. Run ad hoc queries directly on
data within Azure Databricks.
Once the team starts to work on incremental loads, the Modern Data Warehouse architecture should have been
built, with focus on the Azure Data Bricks to cleanse and transform data into Azure SQL Data Warehouse.
See further details on Data Migration pipeline to understand how this architecture supports your migration.
By provisioning an environment in under an hour in Azure, it allows both activities to run in parallel. The key to
getting the schema from Teradata to SQL Data Warehouse is to build a Teradata Database VM which we can
recreate the on-premise schema on.
The process of extracting the Teradata tables to Azure SQL Data Warehouse is:
Create Teradata
Create an Jumpbox Create a Azure SQL
Database in Azure
Virtual Machine and Data Warehouse
using Azure
Configure Database
Marketplace
Extract Schema
If you are able to provision a virtual machine quickly in your datacenter, close to the
Teradata database, then this is the preferred process. This would allow you to start at
“Create Jumpbox”. However, in our experience, internal processes can slow this down
and with a schema the project will be on hold for ages.
Using Azure Marketplace, it is possible to deploy a 2-node development edition of Teradata Database 15.x and
to then use create a copy of the schemas you wish to migrate. Only Teradata Database is required, no other
Teradata components are required.
Pre-requisites:
You should have extracted a copy of the Teradata Schema scripts for the objects you wish to migrate. You will
need one vNet with three subnets provisions and an Azure Subscription with contributor permission on at least
one empty Azure resource group.
After the Teradata cluster has been created, you will need to open port 1025 to allow the Jumpbox VM to
communicate with it.
Full details on creating a Teradata Azure VM can be found in the Appendix under “Create Teradata Database
Azure VM”.
Create Jumpbox VM
To facilitate copying the schema between the Azure Teradata Cluster and Azure SQL Data Warehouse you
should create a Windows 2012 Azure VM (Size D4_v2). Using the Data Science Virtual Machine - Windows
Server 2012 edition, from the Azure Marketplace, will save you time in installing SQL Server Management
Studio and Visual Studio – since these components are pre-installed on the Azure Marketplace image.
Additional software which should be installed after this has been provisioned (also applies to on-premise virtual
machine): -
Full details on creating a Teradata Azure VM can be found in the Appendix under “Create Jumpbox VM”
Prior to using Attunity Replicate, to create the schema, you will need to create a master key for Azure SQL Data
Warehouse.
To run loads with appropriate compute resources, we recommend that you create a load database user resource
class to fit the resource class and load need (see Loading Best Practices).
Full details on Creating an Azure SQL Data Warehouse Database can be found in the Appendix under ”Create a
SQL DataWarehouse Dev DB”
By approaching the schema migration using a separate instance away from Production, you can remove several
barriers around connecting to a production instance.
Full details on how you can create the Teradata schema can be found in the Appendix under ”Create the
Teradata Schema in Azure”.
Attunity Replicate for Microsoft Migrations is installed on the Jumpbox VM and is used transfer the Teradata
schema to Azure SQL Data Warehouse.
The table schema will need to be performance tuned as each table is created as a ROUND ROBIN table.
Once migrated, you should check that the schema has been transferred. The full process is described in the
Appendix under “Using Attunity Replicate to Create Schema”.
Extract Schema
At this point in the schema creation process, it is advisable to extract the Azure SQL Data Warehouse schema
and check it into your code store.
The easiest way to achieve this is with the mssql-scripter. mssql-scripter is the multiplatform command line
equivalent of the widely used Generate Scripts Wizard experience in SSMS.
You can use mssql-scripter on Linux, macOS, and Windows to generate data definition language (DDL) and
data manipulation language (DML) T-SQL scripts for database objects in SQL Server running anywhere, Azure
SQL Database, and Azure SQL Data Warehouse. You can save the generated T-SQL script to a .sql file or pipe
it to standard *nix utilities (for example, sed, awk, grep) for further transformations. You can edit the generated
script or check it into source control and subsequently execute the script in your existing SQL database
deployment processes and DevOps pipelines with standard multiplatform SQL command line tools such as
sqlcmd.
This will extract the Azure SQL Data Warehouse Schema to allow you to recreate the database later. For a full
command line reference please see https://github.com/Microsoft/mssql-scripter/blob/dev/doc/usage_guide.md.
Some Azure SQL Data Warehouse features can significantly improve performance as they are designed to work
in a distributed fashion. However, to maintain performance and scale, some features are also not available. See
Migrate your SQL code to SQL Data Warehouse.
Metadata Approaches
There are two approaches which you can take to migrating the stored procedures and views which reside in the
Teradata Database.
However, if you are migrating to different application stacks and have the timelines to be able to move business
logic migration and data loads, then investment upfront in this will leave you with less technical debt.
Looking at this code there are multiple challenges with trying to execute this against SQL Data Warehouse. To
convert these into T-SQL we would apply the following changes:
These are small changes and shows why our approach of Datometry followed by recoding eases the migration
path to SQL Data Warehouse.
The finished T-SQL script, with changes highlighted in read, would look like:
Several great benefits exist when leveraging Azure Analysis Services (AAS) with SQL DW:
▪ Use Cached mode for maximum query performance, where you have complex measures, full cache
model and to be able to pause/scale Azure SQL Data Warehouse up and down for loads. The challenge
with this mode is reduced data freshness, data must fit into memory and it is read only (no writeback).
▪ Use Direct Query for sourcing data directly, optimal data freshness and query at base fact level. The
challenge is inconsistent query performance, consumption of concurrency slots, no caching, requires
Azure SQL Data Warehouse to be online and may require higher DWU
Data Migration
There multiple ways to approach data migration and your choice will depend on the connectivity you have from
your datacenter to Azure.
Execution
Source-controlled pipeline
Data export, transfer and import steps runs mostly from the source data system
Reuse existing computer and storage resources at the source system
May use custom scripts and programs or ETL tools like SSIS run from the source database server Azure
Controlled
Adopting the Azure Controller pipeline approach, all the way from Azure down to the on-premise resources
removes human intervention and reduces the number of mistakes. It relies on less on skills and more on the
infrastructure you have enabled.
Teradata ships with FASTEXPORT utilities which allow you to export data which is called by Teradata Parallel
Transporter. Azure Data Factory v2 Teradata connector calls the Teradata Parallel Transporter.
An internet connection to the virtual machine is required to allow Azure Data Factory to connect to self-hosted
runtimes which will orchestrate the export/import process.
If your data export is between 1TB and 2TB at a time, having at least two virtual machines, with self-hosted
runtimes, will increase throughput by 1.5x.
The following items are required on the Virtual Machine located in the Teradata Datacenter: -
▪ Windows Server 64bit 2012 R2, 8 cores and minimum 256GB RAM.
▪ Enough disk space to export your data to. This depends on how much whitespace you have in your
database tables. It is possible to compress a 1.2TB extract to 100GB using pipe delimited text, with
GZIP, or ORC formats if most columns are NULL or have repeated values.
▪ Teradata Tools and Utilities 15.1 (https://downloads.teradata.com/download/tools/teradata-tools-and-
utilities-windows-installation-package).
▪ Configure Virtual Machine as per the instructions on Configuring Data Science VM earlier in this
document.
▪ PowerShell with the Azure PowerShell module 3.6 or later (see install documentation at
https://docs.microsoft.com/en-us/powershell/azure/install-azurerm-ps?view=azurermps-6.8.1)
▪ SQL Server Management Studio (https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-
management-studio-ssms?view=sql-server-2017)
▪ Microsoft Connectors for Oracle and Teradata by Attunity for Integration Services (SSIS)
(https://docs.microsoft.com/en-us/sql/integration-services/attunity-connectors?view=sql-server-2017)
For the integrated history load process, we would suggest you create a Windows 2012 R2+ Virtual Machine
located as near to the existing Teradata Cluster and follow the instructions on Configuring Jumpbox VM earlier
in this document.
You should then follow the Attunity Replicate for Microsoft Migrations instructions for migrating data which
are maintained outside this document.
The integrated approach relies on Azure ExpressRoute being in place between the source Teradata location
datacenter and Azure datacenter location, then using Attunity Replicate for Microsoft Migrations to ease the
historical load process.
Attunity Replicate for Microsoft Migrations can be scaled out onto multiple nodes and use one node to co-
ordinate the load process. It is recommended that you build VMs, based on their recommended specifications, in
the Teradata source location datacenter. Attunity will query data locally and then upload to blob storage before
creating external tables and importing data using a Create Table as Select to the base tables.
Differentiated Export
Data export, transfer and import are distinctly executed with each reading from or writing to intermediate
files
File compression is used to reduce the cost and time in transferring files from the source system to Azure
Compressed files are transferred to Azure-based storage before import
When the connectivity from source to Azure is lower in bandwidth or reliability, this approach may turn
out to be more feasible.
Typically realized by custom programs and scripts working independently, using tools such as bcp.exe,
AzCopy, and compression libraries
The data is transferred to local storage which can then be copied to Azure Databox Disks and shipped back to
the Azure datacenter. Full details on Azure Databox (https://docs.microsoft.com/en-us/azure/databox/data-box-
disk-overview).
The time to transfer data over the internet or Azure Express route will drive the decision to use Azure Data
Factory or AzCopy to upload the export files instead of the Azure Databox service.
For example a 100TB export, over a 70MBps connection, assuming you have 100% bandwidth all of the time,
would take almost 140 days to upload. Anything more than 14 days to upload should make use of the Azure
Databox service.
To summarize, the process flow, of the two methods described would be broken into: -
Export from Teradata to Local Disk Export from Teradata to Local Disk
using Azure Data Factory V2 using Azure Data Factory V2
▪ For large databases following the different differentiated export approach, with Azure Data Factory,
and export tables to Pipe Delimited Text with gzip or ORC from Teradata – AZCOPY to Azure Blob
Storage or Azure Data Lake Store. Use Azure Data Factory to PolyBase into Azure SQL DW.
▪ To force Polybase instead of BULKCOPY, in Azure Data Factory, make sure:
− The input dataset type is AzureBlob or AzureDataLakeStoreFile. The format type under type
properties is OrcFormat, or TextFormat (Pipe Delimited Text is fastest), with the following
configurations:
▪ fileName doesn't contain wildcard filter.
▪ rowDelimiter must be \n.
▪ nullValue is either set to empty string ("") or left as default, and treatEmptyAsNull is not set to
false.
If you have no Azure ExpressRoute or less than 70MBps VPN to Azure, then:
▪ Using Azure Data Factory, export tables to Pipe Delimited Text or ORC from Teradata and use
DataBox to transfer to Azure Data Center – Polybase into Azure SQL DW.
If your POC doesn’t have Azure ExpressRoute or you need to load data in stages, then you will need to look at
alternative ways to load data from Teradata to SQL Data Warehouse, using the differentiated approach and your
favorite ETL toolkit.
All configuration activities are executed from the Virtual Machine located nearest to the Teradata Cluster.
End-to-End Workflow
▪ The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Then,
the pipeline triggers another pipeline, which iterates over each table in the database and performs the
data copy operation.
▪ The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in
the list, copy the specific table in Teradata to the local storage.
Note: Depending on bandwidth, you would copy the local storage to Azure Data Box disks, and ship Azure
Datacenter, or Use AzCopy with multiple threads (no more than the number of processor cores on the VM) to
copy to blob storage.
▪ The third pipeline looks up the list of tables that needs to be copied over to the sink data stores. Then,
the pipeline triggers another pipeline, which iterates over each table in the database and performs the
data copy operation.
▪ The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in
the list, copy the specific table in blob storage to the Azure SQL Data Warehouse.
You could keep these metadata tables simple BUT you must keep in mind that if you have some very large
tables, you will want to split these into smaller export files, as opposed creating one large export file, which will
cause out of memory issues during export and are much harder to reload should the import fail at some point.
During our testing, we always ran a smaller 1.5TB test export a couple of times, to make sure that the multiple
file sizes for one table were less than 2GB.
The simple metadata table, called TablesToProcess, has schemaname and tablename columns. If you wanted to
add intelligence to this, you could add an imported smallint column to allow ADF to update this column to 1
upon successful import or 3 upon failure.
A more advanced metadata table structure would still need one table called TablesToProcess which has
schemaname, tablename, filtercol, FilterVal and imported columns.
In the advanced metadata table, you will see multiple entries for the same table. As Azure Data Factory accepts
parameters, we can build dynamic SQL where clauses. The filtercol contains the column we will filter on and
the filtervalue is the value we will compare to.
When the pipeline runs, it will use the advance metadata table to export data into smaller files. The import will
use the same table to move the data into Azure SQL Data Warehouse.
To best understand this pipeline we would suggest running through the detailed setup on the Appendix.
Next Steps
This section details the next areas you need to put into your migration plan once the schema and data have to
been moved to Azure SQL Data Warehouse.
Data Distribution
Data Distribution is an important concept in SQL Data Warehouse. It allows breaking down data processing
requirements for efficient parallel processing across all hardware nodes. When creating any table in SQL DW,
the distribution key is used to determine the node on which each row of data is stored. That is, any table will be
distributed across multiple nodes based upon the distribution key. An important factor in SQL DW database
design is selecting good distribution keys for fact tables.
The most expensive queries in MPP systems such as SQL DW, are those that involve a lot of data movement.
When you execute a query, SQL DW will redistribute rows across the compute nodes as necessary to fulfill the
query such as on JOINs and GROUP BY statements. Users should create tables with distribution styles that
minimize data movement, reflective of queries that would occur in their normal workload.
▪ Hash Distribution – Based on the result of a hashing algorithm on the chosen key, SQL Data
Warehouse will place the row in one of sixty distributions.
▪ Round Robin – SQL Data Warehouse will distribute each incoming row in a round robin fashion
▪ Replicated – SQL Data Warehouse will replicate the table on each of the compute nodes, backed by a
round robin table for insert performance.
Once data has been loaded, you should redistribute tables to a more appropriate distribution type.
Indexing
SQL Data Warehouse offers several indexing options including clustered columnstore indexes, clustered
indexes and nonclustered indexes, and a non-index option also known as heap.
Clustered columnstore tables offer both the highest level of data compression as well as the best overall query
performance. Clustered columnstore tables will generally outperform clustered index or heap tables and are
usually the best choice for large tables. Clustered columnstore is the best place to start when you are unsure of
how to index your table.
There are a few scenarios where clustered columnstore may not be a good option:
▪ Columnstore tables do not support varchar(max), nvarchar(max) and varbinary(max). Consider heap or
clustered index instead.
▪ Columnstore tables may be less efficient for transient data. Consider heap and perhaps even temporary
tables.
▪ Small tables with less than 100 million rows. Consider heap tables.
For futher information on Indexing tables in SQL Data Warehouse please read https://docs.microsoft.com/en-
us/azure/sql-data-warehouse/sql-data-warehouse-tables-index
Partitioning
Fact tables are typically millions or billions of rows in a data warehouse and using table partitioning during
loading and querying can improve performance.
Benefits to loads
The primary benefit of partitioning in SQL Data Warehouse is to improve the efficiency and performance of
loading data by use of partition deletion, switching and merging see (https://docs.microsoft.com/en-us/azure/sql-
data-warehouse/sql-data-warehouse-tables-partition). In most cases data is partitioned on a date column that is
closely tied to the order in which the data is loaded into the database. One of the greatest benefits of using
partitions to maintain data is the avoidance of transaction logging. While simply inserting, updating, or deleting
data can be the most straightforward approach, with a little thought and effort, using partitioning during your
load process can substantially improve performance.
Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table
might contain just data for the past 36 months. At the end of every month, the oldest month of sales data is
removed from the table. This data could be removed by using a delete statement to delete the data for the oldest
month. However, removing a large amount of data row-by-row with a delete statement can take too much time,
as well as create the risk of large transactions that take a long time to rollback if something goes wrong. A more
optimal approach is to drop the oldest partition of data. Where deleting the individual rows could take hours,
dropping an entire partition could take seconds.
Benefits to queries
Partitioning can also be used to improve query performance. A query that applies a filter to partitioned data can
limit the scan to only the qualifying partitions. Parition elimination can avoid a full table scan and only scan a
smaller subset of data. With the introduction of clustered columnstore indexes, the predicate elimination
performance benefits are less beneficial, but in some cases, there can be a benefit to queries. For example, if the
sales fact table is partitioned into 36 months using the sales date field, then queries that filter on the sale date can
skip searching in partitions that don’t match the filter.
Indexing
Clustered columnstore should be your default option for most table types. Clustered columnstore is a highly
efficient compressed storage format that reduces IO greatly during query time.
Clustered indexes enforce a physical row sorting of your data on disk. Consider this type of index only on
relatively small tables (< 60 million rows) such as (replicated) dimension tables that benefit from range filtering
such as on date columns.
Sizing Partitioning
While partitioning can be used to improve performance in some scenarios, creating a table with too many
partitions can hurt performance in others. These concerns are especially true for clustered columnstore tables.
For partitioning to be helpful, it is important to understand when to use partitioning and the number of partitions
to create. There is no hard-fast rule as to how many partitions are too many, it depends on your data and how
many partitions you are loading simultaneously. A successful partitioning scheme usually has tens to hundreds
of partitions, not thousands.
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to
each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1
million rows per distribution and partition is needed. Before partitions are created, SQL Data Warehouse already
divides each table into 60 distributed files. Any partitioning added to a table is in addition to the distributions
created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given
that SQL Data Warehouse has 60 distributions, then the sales fact table should contain 60 million rows per
month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended
minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows
per partition. The Indexing article on Microsoft docs includes queries that can assess the quality of cluster
columnstore indexes.
Data Validation
Once the performance tuning of the schema has been completed. It is important to check that all the data has
been transferred across.
You should implement a robust testing plan which includes comparing row counts, data types and checking that
queries between on the source and on Azure SQL Data Warehouse return the same amount of data.
A full testing plan is outside the scope of this whitepaper but consider defining a repeatable process.
Microsoft have publish a whitepaper called Performance Tuning of Tabular Models in SQL Server 2012
Analysis Services. This will help you optimise and performance tune your model.
Optimise the processessing of Azure Analysis Services by using partitioning, in your model, which align with
the partitions in Azure SQL Data Warehouse.Large datasets normally require table partitioning to accelerate and
optimize the data-load process. Azure Analysis Services partitioning enables incremental loads, increases
parallelization, and reduces memory consumption. See Automated Partition Management with Azure Analysis
Services and Azure Analysis Service Partitions.
Reporting
After creating your Azure Analysis Service Model your should recreate your reports. For a Proof of Concept,
choose no more than five reports and decide if you will keep your existing reporting technology or adopt a new
solution.
If the reports are complex and are going to take time to transform, then Datometry is a great solution for being
able to lift and shift your reporting solution whilst you evaluate your path.
If you are moving to a Microsoft reporting solution such as Power BI it can still make sense to repoint your
current reporting solution, via Datometry, whilst you redevelop your reports. This is a much quicker way to get
users working with your modern data warehouse.
9. Using the Search function, in the top right corner, search the market place for Teradata Database and click
the selection as it appears in the drop-down list.
10. Before continuing check the price plans. For the purpose of this document we will be creating a
Development version of Teradata Database which, as at 4th July 2018, is provided free (All you pay for is
the Azure compute and storage). Teradata is licensed for use on specific Azure VMs. For latest information
on supported software and VMs, visit www.teradata.com/azure/products/.
11. Click Get Now and fill in the details on the form, click continue, and this will then take you through to
provisioning Teradata for Azure.
12. Click Create to take you through to the next Azure blade.
13. Fill out the user name, password and make sure that you chose to create a new resource group. Failure to use
a new resource group will see the deployment fail.
14. Click OK
15. On the Teradata database blade, fill out the DBC password (this is the password you will use to connect to
Teradata).
a. Choose your Purchase method. For our lab we selected Hourly.
b. Choose your Database version and Tier. As most migrated systems are from 15.10 we chose this for our
lab and the development tier as we are only created a schema.
c. Leave the VM size at Standard D15 v2
d. Change the number of nodes from 2 to 1.
16. Click OK
25. On the General Setting Tab click onto subnets and chose the three subnet you wish to use. Make sure you
allocate the subnet you wish to communicate on to the VM subnet as you won’t be able to communicate over
the other two networks.
26. Click OK.
27. Click OK
28. On the Summary page you will see the information which will be passed to the ARM template.
29. Review the summary information
30. Click OK.
Important: once the Teradata database VM has been provision. do not shut this VM down
until you have finished the schema conversion, or it will fail to come back online.
1. In the Azure Portal go to Virtual Machines and click on the Teradata Database VM you created.
33. In the VM Blade click on Networking to display the current port rules.
34. You will only add a port rule to the first network card. Usually end with “-NIC00”
Create Jumpbox VM
We will use this to connect to the Teradata Azure VM and load some additional software onto this machine to
assist with the schema migration.
38. Using the Search function, in the top right corner, search the market place for Data Science Virtual Machine
- Windows 2012 and click the selection as it appears in the drop-down list.
39. Before continuing check the price plans. For the purposes of this document we will be creating a DS3_v2.
44. Under choose virtual machine size type DS3_v2 into the search box.
45. Click on the VM size
47. Under Settings checked the disk type is HDD and that managed disk are being used.
48. Click on Virtual Network and select the network which you previously used for the Teradata Database VM
to connect to.
49. Click on Subnet and select the subnet which you allocated to the Teradata Database VM connection.
50. To allow public access, then leave the Public IP address allow. Otherwise select public IP and then select
none.
51. Accept the other defaults and Click OK
Configure Jumpbox VM
The jump box needs to be configured with software to connect to the Teradata database, to create your schema,
and has the migration software installed which will take your Teradata schema across to SQL Data Warehouse.
1. Remote Desktop to the Azure Data Science VM you created in the previous step.
54. Download the Teradata Tools and Utilities v15.xx from
http://downloads.teradata.com/download/tools/teradata-tools-and-utilities-windows-installation-package.
55. Extract the ZIP file to C:\InstallFiles\TeradataTools
56. Go to C:\InstallFiles\TeradataTools\TTUExpress
57. Launch TTU_BASE to install core Teradata Components
60. Download the Attunity Replicate for Microsoft Migrations User and Setup Guide from
https://attunity.jiveon.com/servlet/JiveServlet/downloadBody/1013-102-5-
1048/AttunityReplicateMSM_5_5_0_User_Guide.pdf
61. Open PDF and follow installation instructions from section three to four.
62. Chose “Install local Replicate for Microsoft Migrations Server”.
63. Download Java SE Runtime Environment 8 and install from
http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html
64. Right-click on My Computer, and then select Properties.
65. Click on Advance Settings.
66. Click on Environment Variables.
67. Add C:\Program Files\Teradata\Client\15.10\ODBC Driver for Teradata to the System Path.
70. Download, Open and Install the Teradata Developer tools from Visual Studio from
https://marketplace.visualstudio.com/items?itemName=MikeDempsey.TeradataDeveloperTools
78. Click Server and create a new server (make a note of the login, password and server name).
80. Click Performance level and configure for Gen1 DWU100 – you can scale this up later when you need to
load data.
81. Click Apply.
1. Remote Desktop to the Azure Data Science VM you created in the previous step.
83. Open SQL Server Management Studio.
86. In Object Explorer, expand Databases. Then expand System databases and master to view the objects in the
master database. Expand the Azure SQL Data Warehouse Database you created to view the objects in your
new database.
87. To create the user for loading data, in SSMS right click master to show a drop-down menu and choose New
Query.
88. In the query window, enter these T-SQL commands to create a login and user named LoaderRC60,
substituting your own password for 'a123STRONGpassword!'.
a. CREATE LOGIN LoaderRC60 WITH PASSWORD = 'a123STRONGpassword!';
b. CREATE USER LoaderRC60 FOR LOGIN LoaderRC60;
89. Click Execute.
90. Right-click your Azure SQL Data Warehouse Database and choose New Query. A new query Window
opens.
91. Enter the following T-SQL commands to create a database user named LoaderRC60 for the LoaderRC60
login. The second line grants the new user CONTROL permissions on the new data warehouse. These
permissions are similar to making the user the owner of the database. The third line adds the new user as a
member of the 'xlargerc' resource class.
a. CREATE USER LoaderRC60 FOR LOGIN LoaderRC60;
b. GRANT CONTROL ON DATABASE::[<YourSQLDWDB>] to LoaderRC60;
c. EXEC sp_addrolemember 'largerc', 'LoaderRC60';
92. Click Execute
93. To create the user for loading data, in SSMS right click master to show a drop-down menu and choose New
Query. In the query window, enter this T-SQL commands
94. CREATE MASTER KEY
100. In the new query window, type the statements to recreate your tablespace(s) on Teradata
101. Before executing you need to connect to the Azure Teradata instance. From menu select Teradata and
go to Query Tool > Connect
102. The Connection Information dialogue allows you to connect to the instance.
a. Data Source Name: TD
b. Server: IP of Teradata Network -NIC00 interface
c. User Name: dbc
d. Password: <Password you set>
103. Click OK
104. From menu select Teradata and go to Query Tool >Execute
105. Repeat the above to create the tables in the Teradata Database
http://<computer name>/attunityreplicatemsm
107. Add a Teradata Migration Task to the Attunity Replicate for Microsoft Migrations Console.
After opening the console, the next step is to add a migration task.
1. Make sure that you are in TASKS view. The word "TASKS" should be displayed in the upper left corner of
the Attunity Replicate for Microsoft Migrations Console.
108. Click New Task to open the New Task dialog box.
109. In the Name field, enter either TeradatatoAzureSQL and then click OK.
110. Click the Apply Changes button to disable this task option. When this option is enabled, Replicate for
Microsoft Migrations detects any changes to the source tables and automatically applies them to the target
tables.
111. Click OK to add the task (the task is displayed in a new tab)
1. In the Attunity Replicate for Microsoft Migrations console, click Manage Endpoint Connections to open the
Manage Endpoint Connections dialog box.
2. Click New Endpoint Connection.
3. In the Name field, type a name for your Teradata database. This can be any name that will help to identify
the database being used.
4. In the Description field, type a description that helps to identify the Teradata database. This is optional.
5. Select SOURCE as the database role.
1. In the Attunity Replicate for Microsoft Migrations Console, click Manage Endpoint Connections to open the
Manage Endpoints Connections dialog box.
2. In the Manage Endpoint Connections dialog box, click New Endpoint Connection.
3. In the Name field, type a name for your SQL Data Warehouse data warehouse [service]. This can be any
name that will help to identify your SQL Data Warehouse database.
4. In the Description field, type a description that helps to identify the SQL Data Warehouse target database.
This is optional.
5. Select TARGET as the role.
6. Select SQL Data Warehouse as the Type.
7. Enter the following SQL Data Warehouse information:
8. Server name: Specify the name of the SQL Data Warehouse server you created earlier
a. Port: Specify port 1433
b. User name: LoaderRC60
c. Password: Specify the password for the user entered in the LoaderRC60 User.
d. Database name: Specify the target database name.
e. Azure Blob Storage Access: During a migration task, SQL Data Warehouse authenticates itself to Azure
Blob Storage using an SQL Server Credential. You can either configure Replicate for Microsoft
Migrations to create the Credential automatically during runtime (the default) or use an existing
Credential.
f. Select Automatically create SQL Server Credential
g. Enter the following Microsoft Azure Blob Storage information. You may need to click the Microsoft
Azure Blob Storage header to see the information.
h. Account name: Specify the name of the Azure Blob Storage account to which you created earlier.
i. Container name: Specify the name of the Azure Blob Storage container which you created earlier.
j. Account key: Specify the key for your Azure Blob Storage account.
k. Folder: Specify the name of the Azure Blob Storage folder to which you want the files copied.
116. In the Advanced tab, set the following properties: Max file size (MB): Change the file size to 2000MB.
The Endpoints pane is in the left of the console. Replicate for Microsoft Migrations supports several endpoint
types (database, Hadoop, data warehouse, file, NoSQL) with each endpoint type having its own icon. As both of
our endpoints are databases, the following icons will be displayed:
1. In the right pane of the Attunity Replicate for Microsoft Migrations Console, click the
2. Table Selection button.
3. The Select Tables dialog box opens.
4. In the Select Tables dialog box, do the following:
5. From the Schema list, select the schema containing your source tables, and then click Search.
6. From the Table List, either:
− Hold down the [Ctrl] button and select the tables you want to migrate. Then, click the > button
to move the tables to the Selected Tables list.
− Click the >>> button to move all the tables to the Selected Tables list.
The Starting task dialog is displayed, and the console automatically switches to Monitor view:
− Full Load tab: Uses status bars to provide a visual indication of the loading status during the full
load operation.
− Change Processing tab: Shows any changes that occur after the full load completes. Since this a
Full Load only task, this tab is not relevant.
122. Click the Completed status bar. Replicate for Microsoft Migrations displays a table below the graphs
with statistics related to the migrated tables.
123. Click the Total Completion bar to see a summary of the task.
1. Remote Desktop to the Azure Data Science VM you created in the previous step.
124. Open SQL Server Management Studio
125. In the Connect to Server dialog box enter
a. Server type
b. Database engine
c. Server name: The fully qualified server name
d. Authentication: SQL Server Authentication
e. Login: The server admin account (This is the account that you specified when you created the server)
f. Password
126. The password for your server admin account
127. Click Connect. The Object Explorer Windows opens in SSMS.
128. In Object Explorer, expand Databases. Expand the AZURE SQL DATA WAREHOUSE Database, you
created, then tables to view the objects in your new database.
Using PowerShell, with the Azure PowerShell module 3.6 or later, open “.\02-ADF\01-Pre-Requisites.ps1” and
step through the script.
Sign in to Azure
Sign in to your Azure subscription with the Connect-AzureRmAccount command and follow the on-screen
directions.
Connect-AzureRmAccount
If you don't know which location you want to use, you can list the available locations. After the list is displayed,
find the one you want to use. This example uses eastus. Store this in a variable and use the variable so you can
change it in one place.
$resourceGroup = "myResourceGroup"
New-AzureRmResourceGroup -Name $resourceGroup -Location $location
$ctx = $storageAccount.Context
Create a container
Blobs are always uploaded into a container. You can organize groups of blobs like the way you organize your
files on your computer in folders.
Set the container name, then create the container using New-AzureStorageContainer, setting the permissions to
'blob' to allow public access of the files. The container name in this example is import
$containerName = "import"
New-AzureStorageContainer -Name $containerName -Context $ctx -Permission blob
133. In Object Explorer, expand Databases. Right click on the database you created earlier and select new
query.
134. In the query window, enter this T-SQL commands
1. Start PowerShell on your machine, and keep it open through completion of this section. If you close and
reopen it, you'll need to run these commands again.
a. Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:
Connect-AzureRmAccount
b. Run the following command to view all the subscriptions for this account:
Get-AzureRmSubscription
c. Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
Select-AzureRmSubscription -SubscriptionId "<SubscriptionId>"
a. The name of the Azure data factory must be globally unique. If you receive the following error, change
the name and try again.
b. To create Data Factory instances, you must be a Contributor or Administrator of the Azure subscription.
Id : /subscriptions/<subscription
ID>/resourceGroups/ADFTutorialResourceGroup/providers/Microsoft.DataFactory/factori
es/onpremdf0914/integrationruntimes/myonpremirsp0914
Type : SelfHosted
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : onpremdf0914
Name : myonpremirsp0914
Description : selfhosted IR description
138. To retrieve the status of the created integration runtime, run the .\02-ADF\ 05bSelfHostedRuntime.ps1
script.
Nodes : {}
CreateTime : 9/14/2017 10:01:21 AM
InternalChannelEncryption :
Version :
Capabilities : {}
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
AutoUpdate :
ServiceUrls : {eu.frontend.clouddatahub.net,
*.servicebus.windows.net}
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Name : <Integration Runtime name>
State : NeedRegistration
139. To retrieve the authentication keys for registering the self-hosted integration runtime with the Data
Factory service in the cloud, run the following command. Copy one of the keys (excluding the quotation
marks) for registering the self-hosted integration runtime that you install on your machine in the next step.
Get-AzureRmDataFactoryV2IntegrationRuntimeKey `
-Name $integrationRuntimeName `
-DataFactoryName $dataFactoryName.DataFactoryName `
-ResourceGroupName $resourceGroupName.ResourceGroupName | `
ConvertTo-Json
{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
When the self-hosted integration runtime is registered successfully, the following message is displayed:
140. In the New Integration Runtime (Self-hosted) Node window, select Next.
141. In the Register Integration Runtime (Self-hosted) window, select Launch Configuration Manager.
142. When the node is connected to the cloud service, the following message is displayed:
In this step, you link your on-premises Teradata instance to the data factory.
1. Open the JSON file named TeradataLinkedService.json in the C:\ADF\Connectors folder by using the
following code:
h. Replace <integration runtime name> with the name of your integration runtime.
i. Before you save the file, replace <servername>, <username>, and <password> with the values of your
Teradata instance.
j. If you need to use a backslash (\) in your user account or server name, precede it with the escape
character (\). For example, use mydomain\\myuser.
In this step, you link your on-premises Teradata instance to the data factory.
1. Open the JSON file named FileLinkedService.json in the C:\ADF\Connectors folder by using the following
code:
k. Replace <server> and <ShareName> with the values of your file path either fileshare or x:\ExportFiles
l. Before you save the file, replace <Domain>, <UserName>, and <password> with the values of user
which has permission to write files to the local storage.
m. If you need to use a backslash (\) in your user account or server name, precede it with the escape
character (\). For example, use mydomain\\myuser.
n. Replace <accountName> and <accountKey> with name and key of your Azure storage account before
saving the file.
1. Open the JSON file named AzureStorageLinkedService.json in the C:\ADF\Connectors folder by using the
following code:
p. Replace <servername>, <databasename>, <username> and <password> with values of your Azure SQL
Database before saving the file. You should use the user, with the largerc, you created earlier.
1. Run the .\02-ADF\ 06-CreateLinkedServers.ps1 script to create an all of the linked services
q. Select the Azure Resource Group you created or used in the previous step.
r. Select the Azure Data Factory you created in the previous step.
s. Select the Self Hosted Integration Runtime created in the previous step.
Create Datasets
1. Run the .\02-ADF\07-CreateDatasets.ps1 script to create an all of the linked services
t. Select the Azure Resource Group you created or used in the previous step.
u. Select the Azure Data Factory you created in the previous step.
This will create all the datasets required for the four pipelines to use.
Dataset Notes
In SourceDW.json, in the C:\ADFv2\02-ADF\DataSets folder, “tableName" is a dummy one which the pipeline
will use the SQL query in copy activity to retrieve data. This can be checked or changed in the Azure Data
Factory User Interface SourceDW dataset.
In TDDataset.json, in the C:\ADFv2\02-ADF\DataSets folder, “tableName" is a dummy one which the pipeline
will use the SQL query in copy activity to retrieve data. This can be checked or changed in the Azure Data
Factory User Interface TDDataset dataset.
In FileORC.json, in the C:\ADFv2\02-ADF\DataSets folder the "FileDirectory" and “FileTableName” are set as
a parameter. The copy activity that references this dataset passes the actual directory and filename into the
dataset. This can be checked or changed in the Azure Data Factory User Interface FileORC dataset.
These are dynamically referenced in the file path and the file format is set to ORC format.
Create Pipelines
1. Run the .\02-ADF\ 08-CreatePipelines.ps1 script to create an all of the linked services
v. Select the Azure Resource Group you created or used in the previous step.
w. Select the Azure Data Factory you created in the previous step.
This will create all four pipelines to use for migration (GetTableListAndTriggerCopyTDToCSV,
IterateAndCopyTDTablesToCSV, GetTableListAndTriggerCopyBlobtoSQLDW and
IterateAndCopyTDTablesToCSV)
Pipeline Notes
Teradata to Local Storage
The GetTableListAndTriggerCopyTDToCSV pipeline looks up the Azure SQL Data Warehouse system table to
get the list of tables to be copied from Teradata to the local storage. This can be checked or changed in the
Azure Data Factory User Interface GetTableListAndTriggerCopyTDToCSV pipeline by selecting the
LookupTableList activity.
Selecting the TDTableToCSVFilePipeline activity and setting will show how we pass the output of
LookupTableList to the IterateAndCopyTDTablesToCSV pipeline.
The output of the lookup activity is passed into an array parameter which is used by the ForEach loop.
By selecting the ForEach activity you will see that this array to passed to another parameter associated with the
activity.
Editing the activities for IterateAndCopyTDTablesToCSV and selecting CopyTDtoCSV you can investigate
how Teradata data is copied to local storage using a dynamic query.
The local storage is passed a directory and filename based on the schema and tablename. This example is a basic
extract process. You will know your data and we highly recommend that modify this query to partition the data
export based on the advanced metadata approach above.
Selecting the BlobtoSQLDW activity and setting will show how we pass the output of LookupTableList to the
IterateAndCopyBlobToDW pipeline.
The output of the lookup activity is passed into an array parameter which is used by the ForEach loop.
By selecting the ForEach activity you will see that this array to passed to another parameter associated with the
activity.
Editing the activities for IterateAndCopyBlobToDW and selecting CopyCSVtoSQLDW you can investigate
how the blob storage is passed a directory and filename based on the schema and tablename.
Azure SQL Data Warehouse is setup as the sink and makes use of PolyBase, parameters for the destination and
truncates the table before loading. In an advanced metadata approach, you would remove the pre-copy script and
make use of Azure SQL Data Warehouse snapshot backups which are retained for seven days.
References
Why Move to SQL Data Warehouse?
https://datometry.com/solutions/replatforming/migrate-teradata-to-azure
Migrate your solution to Azure SQL Data Warehouse
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-
migrate
Migrating data to Azure SQL Data Warehouse in practice
https://blogs.msdn.microsoft.com/sqlcat/2016/08/18/migrating-data-to-azure-sql-data-
warehouse-in-practice/
Note: For additional information about migrating various source databases to Azure, see the Azure
Database Migration Guide.