Performance of Fastload Vs Multiload
Performance of Fastload Vs Multiload
This is my understanding of the different types of check pointing taking place and the utility
differences...
During Phase I of the Fastload, the rows are de-blocked and redistributed to their proper AMP,
but they are not sorted in hashing order. In phase II of the Fastload, the rows are then sorted
and merged into the actual table. There is some kind of internal check pointing going on in this
phase to keep track of which data blocks have been merged in and which have not. This is
different than the check pointing that takes place in Phase I which keeps track of which rows
have been loaded from the host file. At the end of the phase II of Fastload, the fallback copy of
the table is created (if fallback is specified).
In Mload, during the acquisition phase, if you have fallback on the table, two copies of the row
are created and sent to the AMPs (one to the primary and one to the fallback AMP). This is one
difference between Mload and Fastload. Also, during the acquisition phase, as rows are being deblocked and sent to the appropriate AMP, they are put into the work table in hashing order just
like they exist in the target table. So, at the end of the acquisition phase, there is no need for a
sort (another difference between Mload and Fastload).
Once all of the rows have been put into the work table, the application phase then begins and
the rows are merged from the work table into the target table. There is also check pointing that
takes place in the application phase to keep track of which rows (or data blocks) from the work
table have been applied to the target table and which have not. This is different than the check
pointing that takes place in the acquisition phase, which again, keeps track of which rows have
been loaded from the file.
So, the differences between the two utilities when loading into an empty table would be:
1) Fastload sends one copy of each row regardless of fallback or non-fallback and then
creates the fallback copy in Phase II. Mload creates a second copy of the row if the target
table is defined as fallback. (Of course, you can always alter the table to add fallback at the
end of the Mload or Fastload, which would make this difference moot).
2) Fastload does a sort in Phase II to get the rows in hashing order. Mload puts the rows in
hashing order in the work table as it goes along. So, there is no sort that takes place in Mload
prior to merging the rows into the target table.
If you are going into a non-fallback table, I would think that #2 would be the primary advantage
of using Fastload. Rather than going through the overhead of keeping the rows in the work table
in order as it goes along (which will slow things down), Fastload simply gets the rows to the right
AMP (in phase I) and then sorts them at the end of that process (during phase II).
BTEQ returns rows in blocks from the database to the underlying CLI in blocks as well. Depending on
your setup 64K may be the default but I believe this can be adjusted upwrd to 1MB (CLI parameter
not BTEQ parameter). This should be changed if exporting medium to large result sets.
a2) The primary difference between fastexport and BTEQ export is the ability to ship data over
multiple session connections simultaneously thereby leveraging the total connectivity available
between the client platform and the database engine. In order to do this, Fastexport spends more
resources on executing the query in order to prepare the blocks in such a way that when they are
exported over multiple sessions they can easily be reassembled in the right order by the client without
additiona sorting or processing of the rows.
b1) BTEQ import does process a row at a time. Thus it is generally appropriate only for very small
imports.
b2) Fastload does use a buffered approach. But its real differentiation comes from its use of multiple
simultaneous sessions which allow it to leverage all of the parallelism of the parallel database engine.
Multiple AMPs are working on receiving, deblocking, data type transformation, hashing and
redistribution of the data.
b3) Fastload is only capable of inserting data into an empty table. This obviates the need for
journaling because if it doesn't finish properly, the proper recovery is to simply delete the table and
start over. BTEQ import uses journaling because it can load into an existing table and needs to be able
to roll back an unfinished import row.
c1) Fastload only allows insert into empty tables. Thus the thought is that it is easy to add indexes and
other table attributes after the table has been loaded with the data. No need for journals because the
data is either loaded or it is not. No need to roll forward or back.
c2) The primary reason Fastload does not allow duplicate records is that at the time it was designed
and built, multi-set tables did not exist. Teradata's early days allowed only SET tables so Fastload did
not have to be designed for being able to load duplicate records. It was deemed a use friendly feature
to automatically eliminate them rather than report them as errors. When Multiload was designed it
had a very different set of requirements including Multiset and the ability to insert, update and delete
in existing populated tables. This in turn led to very different design including the sequence numbers
in the incoming rows to allow ordered apply, insert of duplicate rows and the checkpoint/restart case
while still preserving the duplicate rows. It was decided that we would not go back and redesign
Fastload to cover all of these cases but rather to leave it for the simple case of insert into empty tables
and have Multiload handle any other case not supported by Fastload.
Choosing between the bulk tools should be less about relative performance and more about matching
the required functionality to the use case. If inserting data into an empty table (without dups) then
Fastload else Multiload. BTEQ import and TPump are chosen if the import data is small or medium
respectively and it is desirable to avoid the overhead of using a bulk utility slot and the overhead of
startup and shutdown of the bulk utility for small data sets. BTEQ export likewise is appropriate for
small to medium result sets. FastExport is designed to handle th high volume exports in an
expeditious way. And of course TPump can be used when the requirement is to load continuously.