Basic Design Concepts

Basic Design Concepts
y y y y
y y
SSIS: Can be stored in one system and executed from different system in SSIS 2008 Pakcage can be stored either in filesystem or with MSDB Package can be executed either from storage server or from other server. Package can be executed through the following. o BIDS (Business Intelligence development Studio) o DTExecUI o DTSExec(CMD/BAT) o SQL Agent o Programmatic. SSIS can be executed from a system that has SSIS DLLs or Executables . The data flow impact on the package execution server involves the resources needed to manage the data buffers, the data conversion requirements as data is imported from sources, the memory involved in the lookup cache, the temporary memory and processor utilization required for the Sort and Aggregate transformations, and so on. Essentially, any transformation logic contained in the data flows is handled on the server where the package is executed The source server will both provide the extracted data and handle the data flow transformation logic, and the destination server will require any data load overhead (such as disk I/O for files, or database inserts or index reorganization). Following are some of the benefits of this approach: o There is decreased impact on the destination server, where potential users are querying. o Data flow buffers are loaded rapidly, given that the location of the source files and package execution is local and involves no network I/O. o The impact on the destination server is limited, which is useful for destination servers that have 24/7 use, or the SSIS process runs often. Following are some of the drawbacks of this approach: o The impact on the source server s resources, which may affect applications and users on the source server Potential reduced performance of the data flow destination adapter and the inability to use the SQL Destination adapter, which requires the package be executed on the same server as the package Following are some of the benefits of executing a package on a destination server: o Limited impact on your source server if it is running other critical tasks o Potential performance benefits for data inserts, especially because the SQL Destination component can now be used Licensing consolidation if your destination server is also running SQL Server 2008
y y
One drawback of this approach is that it has a heavy impact on the destination server, which may affect o users interacting with the destination server. This approach is very useful if you have users querying and using the destination during the day, and your SSIS processing requirements can be handled through nightly processes
Using a standalone SSIS server if your sources and destinations are not on the same physical machines makes more sense. In this case, the impact on both the source and destination machines is reduced because the SSIS server would handle the data flow transformation logic. This architecture also provides a viable SSIS application server approach, where the machine can handle all the SSIS processing packages no matter where the data is coming from and going to. The drawbacks to this approach lie in the capability to optimize the source extraction and destination import, increased network I/O (because the data has to travel over the wire two times), as well as licensing.
Designing Factors of SSIS Package

y
Before designing any SSIS packages, you must decide on a number of factors that will affect the design of your packages. These factors include auditing, configurations, and monitoring. Although SSIS offers a variety of methods out of the box to help assist with these concerns, a few limitations exist to using these standard methods. In general, one issue that arises is that you must set up all the options every time you create a package. Not only does this slow down development time, it also opens up the possibility of a fellow developer mistyping something, or choosing a different method than you have used. These potential problems can make maintaining and supporting the package a nightmare for someone who did not initially create the package.
Reusability
Logging and Auditing
The package objects that you want to audit are variables and tasks. You record the start and end times of the tasks with task identification information. Whenever a variable s value changes, you record that change with the new value. All log entries and audit information should be linked together to promote easy querying and investigation into the data.
y Configuration management y Logging and auditing mechanism
Template package
Transformations
y y y y y y y y y y y y y y y y y
SSIS provides three TransformationsMerge, Merge Join, and Union All Transformationsto let you combine data from various sources to load into a data warehouse by running the package only once rather than running it multiple times for each source. Aggregate Transformation can perform multiple aggregates on multiple columns. Sort Transformation sorts data on the sort order key that can be specified on one or more columns. Pivot Transformation can transform the relational data into a less-normalized form, which is sometimes what is saved in a data warehouse. Audit Transformation lets you add columns with lineage and other environmental information for auditing purposes. A new addition to SSIS 2008 is the Data Profiling Task, which allows you to identify data quality issues by profiling data stored in SQL Server so that you can take corrective action at the appropriate stage. Using the Dimension Processing Destination and the Partition Processing Destination as part of your data loading package helps in automating the loading and processing of an OLAP database.
Integration Services includes the following transformations that enable you to perform
y y y y y y y y y
various operations to standardize data: Character Map Transformation allows you to perform string functions to string data type columns such as change the case of data. Data Conversion Transformation allows you to convert data to a different data type. Lookup Transformation enables you to look up an existing data set to match and standardize the incoming data. Derived Column Transformation allows you to create new column values or replace the values of existing columns based on expressions. SSIS allows extensive use of expressions and variables and hence enables you to derive required values in quite complex situations.
Data Extraction Best Practices

Data extraction is the process of moving data off of a source system, potentially to a staging environment, or into the transformation phase of the ETL Following are a few common objectives of data extraction: Consistency in how data is extracted across source systems Performance of the extraction Minimal impact on the source to avoid contention with critical source processes Flexibility to handle source system changes The capability to target only n ew or changed records
Extraction Data Criteria

Deciding what data to extract from the source system is not always easy. Whether the destination is a data warehouse or integration with some other system, extracting all the data available from the source
will rarely be necessary. You want to be sure to extract the minimal data elements needed in order to satisfy your requirements, but you should keep in mind some other considerations: Context The data you extract will likely be kept in your staging datab ase for validation and user drill - through reporting purposes. There likely is some extra information from the source system that isn t strictly required, but could improve the meaningfulness of the extracted data. Future Use If destination requirements change in the future, you ll look very smart if you have already been extracting the needed data and have it in your staging database. Predicting what extra data might be needed someday can be tricky, but if you can make reasonable guesses and have the space to spare in your staging database, it could pay off big later.
Source System Impact

A good ETL process will be kind to its source systems. For these sources to be heavily loaded transactional systems, and maybe a legacy mainframe system to boot, is not uncommon. The added impact of a selfish ETL process could potentially bring a delicate source to its knees, along with all of its users. Here are some general principles for creating an extraction process with minimal impact on the source system: Bring the data over from the source with minimal transformations Don t use expressions or functions that execute on the source system. The less the source system has to think about your query, the better. The exception is if the source data uses a data type that your extraction process doesn t understand. In this case, you may need to cast the data to another type. Only extract the columns that you ll use Don t ask the source system for 50 columns and then only use 10. Asking for some extra data as discussed previously is okay because you ll be doing something with it, but retrieving columns that have no purpose whatsoever is not good. Keep selection criteria simple You may need to find a balance between the cost of transferring extra rows, and the cost of filtering them out in the extraction query. Functions or expressions that must think about every single row in the source table would not be good. There may be a significant difference between OrderDate+5 > Today() versus OrderDate > Today() - 5 . Don t use joins This may fall under selection criteria, but it warrants its own bullet. Unless the join is beneficial to reducing the overall extraction cost, it should probably be avoided. Instead of joining from Orders to Products to get the product name, it is better to extract Orders and Products separately. Some of these principles are subjective in nature, and may have different implications for different source systems. Performance tuning is sometimes a s much art as a science, and a little trial and error may be in order. Your mileage may vary.
Incremental Extraction
One of the best ways to be nice to the source system (and it s good for your process, too) is to only extract changed data. This way, you aren t wasting I/O bandwidth and processor cycles on data that hasn t changed since the last time you extracted it. This method is called incremental extraction, because you are getting the data from the source in cumulativ e pieces.
Here are a few of the many ways that you may be able to identify changed source records: Use a modified date or created date column from a database source These are called change identifier columns. In fact, many transactional systems already have change identifier columns that can be used for an incremental extraction. This is probably the most common approach to incremental extraction. Note that a creation date is only valuable if the rows are never updated.
Use an auto - incrementing change identifier If the source system doesn t have a modified date or created date, there may be an auto - incrementing column acting as a change identifier, which increases every time a row change happens. To be useful the incrementing value must be scoped to the table, not the row (that is, no duplicates). Use an audit table Some source systems already have or allow a trigger (or similar mechanism) to capture the changes to an audit table. An audit table may track keys for that source table, or the details of the change. However, this approach involves overhead, because triggers are expensive. Log - based auditing Some database servers provide a log reader - based mechanism to automatically track changes in a similar way to the trigger - based mechanism, but with a much lower overhead. These log - based systems are often called Change Data Capture (CDC) features , and in order to take advantage of them, you need a source system that supports these features.
Deleted Rows
If the source system allows rows to be deleted, and if that information will be important to the destination (sometimes it isn t), then the fact that a row was deleted must be represented somehow in order for incremental extraction to work. Doing so can be difficult using the change ide ntifier extraction method because, if the row is deleted, the change identifier will be, too. Following are a couple of scenarios where deleted rows work with incremental extraction: When a delete is really just an update to a flag that hides the recor d from an application When an audit mechanism records that the row was deleted Sometimes, more creative solutions can be found. But, if the source system doesn t provide a facility to identify the changed and deleted rows, then all rows must be extracted . The change detection then takes place farther along in the ETL process through comparison with existing data.
Staging Database
After you have extracted the source data from the source system, you need a place to put it. Depending on your requirements, you might be able to go directly to the target database, but a staging database is highly recommended. Traditionally, a staging database is one where you perform data cleansing. With today s tools that are capable of performing many data cleansing tasks w hile the data is in flight, staging databases are less of a requirement, but can still be beneficial. The benefits of a staging database may include the following: Data lineage As discussed in the next section, the primary value of a staging database is as a place where the data lineage can be tracked. Data lineage is very helpful for data validation. Restartability By saving the extracted data to a staging database as an interme diate step before performing any loading, the data is safe in case of a load failure, and won t need to be re - extracted. Source alternative A staging database could be handy as an alternative to the source system if that system has limitations. Archive If you choose to do so, the staging database can be designed to function as an archive to track history on the source system. This could allow a full reload of the data warehouse, including history, if it should ever become corrupted. Performance Some types of cleansing and transformation operations are best performed by the database engine, rather than the ETL tool. For example, a database engine can usually sort and perform aggregation functions faster and with less memory. Of course, some downsides exist to using a staging database that may negate some of the benefits just described. Using an intermediary database necessarily involves disks, which may increase storage requirements and decrease performance. Initial development costs may well be higher, but a well designed staging database will probably pay for itself in the long run.
Single-Column Profiles
Single-column profiles enable you to analyze single column independently for Null values, column statistics, pattern profile, length distribution, and value distribution within the column. Column Length Distribution Profile You will perform this computation on a
column containing text strings to identify any outliers. For example, if the column you are profiling contains fixed-length codes, any variation in length will indicate a problem in the data. This profile type computes all the distinct lengths of string values in the selected column and the percentage of rows in the table that each length represents. Column Null Ratio Profile You will perform this computation to find out missing data in a column with any data type. For example, an unexpectedly high ratio of null values in a column indicates the absence of data. This profile computes the percentage of null values in the selected column. Column Pattern Profile This profile request generates a set of regular expressions and the percentage of related string values. You will be using this profile to determine invalid strings in data. This profile can also suggest regular expressions that can be used in the future to validate new values. Column Statistics Profile This profile request works with numeric and datetime columns and can compute statistics for minimum and maximum values. Additionally, you can also generate statistics for average and standard deviation values for numeric columns. This profile can help you to identify values that lie outside the range you expect in a column or have a higher standard deviation than expected. Column Value Distribution Profile This profile will be of the most interest to you in case you want to know the distinct values and their percentage of rows in the column. This can help you understand your data a bit more, or if you already know the number of values, you can figure out the problems in data. This profile request works with most data types, such as numeric, string, and datetime formats.
Multiple-Column Profiles
Using multiple-column profile, you can profile a column based on the values existing in other columns such as candidate key profile, functional dependency profile, and the value inclusion profile. Candidate Key Profile This profile request can identify the uniqueness of a column or set of columns and hence can help you to determine whether the column or set of columns is appropriate to serve as a key for the selected table. You can also use this profile request to find duplicates in the potential key column. Functional Dependency Profile This profile request finds out the extent to which the values in one column are dependent on the values in another column or set of columns. Using this profile, you can validate the data in a column based on the other column. Value Inclusion Profile This profile request checks whether the values in a column also exist in another column. Using this profile, you can identify the dependency and can determine whether a column or set of columns is appropriate to serve as a foreign key between the selected tables.
Integration Services Objects

The following objects are involved in an Integration Services package: Integration Services package The top-level object in the SSIS component hierarchy. All the work performed by SSIS tasks ours within the context of a package.
Control flow Helps to build the workflow in an ordered sequence using containers, tasks, and precedence constraints. Containers provide structure to the package and looping facility, tasks provide functionality, and precedence constraints build an ordered workflow by connecting containers, tas ks, and other executables in an orderly fashion. Data flow Helps to build the data movement and transformations in a package using data adapters and transformations in ordered sequential paths Connection managers Handle all the connectivity needs. Integration Services variables Help to reuse or pass values between objects and provide a facility to derive values dynamically at run time. Integration Services event handlers Help extend package functionality using events ourring at run time. Integration Services log providers Help in capturing information when logenabled events our at run time.

Basic Design Concepts

Uploaded by

Basic Design Concepts

Uploaded by

Basic Design Concepts

Designing Factors of SSIS Package

Logging and Auditing

y Configuration management y Logging and auditing mechanism

Data Extraction Best Practices

Extraction Data Criteria

Source System Impact

Integration Services Objects

You might also like