0% found this document useful (0 votes)
60 views15 pages

Step-By-Step Tutorial - ExcelPowerQuery

The document provides instructions for students to extract, transform, and load financial data from SEC filings for analysis in Tableau. It outlines downloading an Excel file with company information, as well as SEC data files. It describes opening and exploring the files, then using Power Query in Excel to clean the data by removing unnecessary fields, changing data types, filtering to specific periods and forms, and transforming numeric fields to dates. The goal is to join the cleaned SEC data with the company information file for analysis in Tableau.

Uploaded by

So
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
60 views15 pages

Step-By-Step Tutorial - ExcelPowerQuery

The document provides instructions for students to extract, transform, and load financial data from SEC filings for analysis in Tableau. It outlines downloading an Excel file with company information, as well as SEC data files. It describes opening and exploring the files, then using Power Query in Excel to clean the data by removing unnecessary fields, changing data types, filtering to specific periods and forms, and transforming numeric fields to dates. The goal is to join the cleaned SEC data with the company information file for analysis in Tableau.

Uploaded by

So
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 15

Student Instructions for the ViewDrive Analytics Case for TableauPrep and

Tableau Desktop
I. Extracting Data: The first step in the Extract, Transform, and Load (ETL) process
(using Microsoft Excel and Tableau Prep)

A. Preparing the XBRL_SOFT_EXT flat file

1. Locate and download the Excel file you received from your controller that contains XBRL
software and extensions information.

2. Open the file and explore the fields.

3. Note that the field titled [CIK] is the unique identifier for our company list. Other fields to note
include industry, the type of form filed (we use 10-Ks in this example), date filed, XBRL creation
software, and extension percentages.

4. You may notice that the [DATE_FILED] field has numbers instead of dates. Highlight this
column and change the formatting of the cells. Either right-click or click on the number
format button on the menu ribbon and select the SHORT DATE format (MM/DD/YYYY).

5. Save this file and close Excel.

B. Obtaining the SEC financial statement database (tagged)

1. Navigate to the following website. Locate and download the YEAR – Quarter combination
that your instructor has specified for you to use. For example, 2020q1.
https://www.sec.gov/dera/data/financial-statement-data-sets.html

2. The quarter indicates dates of filing by companies. For example, for a company with a fiscal year-
end of December 25, 2016, the financial close and audit can take a couple of months: the filing for
this company is located in the 2017 Q1 download. Thus, the 2020 Q1 download should include
companies with a fiscal year-end of December 2019.

1
3. Right-click on the downloaded zipped file (which is likely in your Downloads folder) and
choose the “Extract All...” option (sometimes located under Send to) to unzip the file contents
into a separate (new) folder.

4. Open the “readme” file in your newly created folder and glance through the document to gain
an understanding of the file contents and field definitions. Obtaining a thorough understanding of
your data is a critical first step in performing data analytics. Leave this important reference
page open as you take the remaining steps of your data extraction and transformation steps in this
tutorial.

5. The extracted folder contains multiple large text (.txt) files with field(cell) values separated by tab.
Given the size of these files, it is recommended to open the (.txt) files from within Excel or
another data management tool (i.e., do not double-click or they will directly open in Notepad).
You may do this by right-clicking (.txt) file>> “Open with”>>program of your choice. In this
case, select Excel to open these types of files (Tip: remember to uncheck “always use this
application for these types of files”).

6. As you read the data definitions of each table (file) in the “readme.htm” file, notice that [ADSH]
and [CIK] are used as unique identifiers for filing companies across the documents contained.
Both these fields are contained in the “SUB.TXT” file (a file with submission details such as fiscal
period ([fp]), and date of filing; however, the “NUM.TXT” file (a file that contains all the financial
statement tagged data, e.g. Assets, Liabilities, Cash, etc.) only utilizes the [ADSH] field as a
unique identifier. Recall that “CIK” was also a unique identifier in the file you received from the
controller, with XBRL software and extension information (“XBRL_SOFT_EXT” flat file).

2
Note that we are interested in identifying the size of companies (your instructor may determine
how you define company size: a good one to use is “Assets”). Such numeric-tagged XBRL data
about companies’ financial position is contained in the “NUM” file, we will later need to join the
“NUM” file with the “SUB” file to get the company identifiers ([CIK]) needed to combine the
retrieved SEC financial statement data with the “XBRL_SOFT_EXT” flat file provided by the
controller (Tip: [CIK] and [filing date] will be used to join this SEC company size data to the
software type and extension data in a later step).

8. You may open Microsoft Excel or other advanced text editing software such as Notepad++ and
explore the “SUB.TXT” and “NUM.TXT” files. Note that certain “NUM.TXT” files may be too
large to open in Excel (may be truncated to the maximum row limit of 1,048,576 rows or fail to
open). If the file is too big for Excel but does not exceed 2.5 GB, MS Access may be an alternate
capable and ubiquitous software to preview the data.

C. Finalizing the Extraction Process


In the next section, we will use an ETL tool (Excel Power Query) to continue to extract and
transform (clean and combine) our files: “SUB.TXT”, “NUM.TXT”, and
“XBRL_SOFT_EXT.XLSX”. To jointly perform the extraction and transformation processes,
you need to open Excel and use the Power Query method of connecting to (extracting) data.

1. In Excel >> Click on the DATA tab on the ribbon >> Click on GET DATA >> From File >>
From Text/CSV. You will use the above procedure to open (extract) “SUB.TXT” and
“NUM.TXT” files since they are in text format.

2. For the “XBRL_SOFT_EXT.XLSX” file, you will need to click on GET DATA >> From File >>
“From Workbook” instead.

3. You will see a preview of the table that is being extracted inside Excel Power Query.

Since some files are too large for Excel, for each file, we will be transforming the files while
connecting to (extracting) them.

3
In the next section, we will use Excel Power Query to connect to and transform (clean and
combine) each of our files: “SUB.TXT”, “NUM.TXT”, and “XBRL_SOFT_EXT.XLSX”.

II. Transforming (cleaning and joining) data: The second step in the Extract, Transform
and Load (ETL) process (using Excel Power Query)

A. Connecting to and cleaning the “SUB.TXT” file

1. In Excel >> Click on the DATA tab on the ribbon >> Click on GET DATA >> From File >>
From Text/CSV. Open the “SUB.TXT” file.

2. Since we want to transform the file before loading it into Excel, click on “Trasform Data” to
begin the cleaning process inside Power Query Editor.

3. The field names, identified data types, and sample field values of the “SUB” file will appear on the
column headings as well as the formula bar.

4. Review the “readme” file that was part of the SEC zip file you had previously extracted to
understand the fields in the “SUB” file. Remove fields (columns) that we will surely not be
utilizing, specifically ([countryma], [stprma], [cityma], [zipma], [mas1], [mas2], [detail], and

4
[ackis]. Identify other fields and remove those that may safely be excluded such as a second
street address fields [bas2], phone number [baph], and former company names [former].

5. Check the data types Excel automatically applied to ensure that they match the purpose of your
analysis. Three important categories of data are worth noting when using Tableau. Numeric (used
as measures), String (used as dimensions), and Date (used as dimensions). Numeric is used when
we want Tableau Desktop to be able to perform calculations on a given field such as sum (totals)
and averages (e.g. Assets, Revenues, and Net Income).

6. Notice that [fy] and [fye] are fields specifying the fiscal year and fiscal year-end (refer to the data
definition in the “readme.htm” file); thus, we will likely use these to categorize and group data by
period rather than to perform calculations such as averages, as it would not make sense to sum the
[fiscal year] field. Change the data types of [fy] and [fye] to Text by right-clicking on the
column heading and selecting “Change Type” (or clicking on “123” to the left of the column
name and selecting ‘Abc Text’).

7. In the [form] field, filter your data to retain the records related to annual SEC filings. To apply the
filter in the field, click on the dropdown at the right of the column heading, uncheck “Select
All” and select only “10-K”. Alternatively, you can click on “Text Filters” >> “Equals”and type
in “10-K”.

8. In the [fp] field, filter your data to show annual submission files by setting fiscal period ([fp]) =
“FY”.

9. Notice that the [period] field containing the fiscal year ending date is a numeric field. This field
needs to be transformed so that the values are Excel-recognizable dates. This is done in two steps:
(1) change the data type to “text”, and (2) change the data type to “date.” In these steps, select
“Add new step” instead of “Replace current” (if you choose “Replace current” it replaces step
#1 where you changed the numeric field to a text field). The [period] field should now show dates
in the format MM/DD/YYYY.

B. Connecting to and Cleaning the “NUM” file

1. In the Power Query editor, from the “Home” menu, click on New Source >> File >> Text/CSV to
locate and bring in the “NUM.TXT” file. Perform the following steps on the “NUM” file.
IMPORTANT: the NUM.TXT file is extremely large and each step below may take a while to

5
execute depending on the memory capacity and processor speed of your computer. Be patient!
Also note that since the file is large, filters may need to be entered manually since all possible
values will not be available for selection.
Perform the following steps on the “NUM” file.

2. Remove the [footnote] column since we will not use footnote data for the analysis.

3. In the [qtrs] field, apply a “Number Filter” to keep only values of 0 OR 4. As indicated in the
“readme.htm” file, “0” indicates a point-in-time value (i.e. numbers on the balance sheet) whereas
“4” represents the fourth quarter.

3. In the [value] field, perform a filter to exclude “null” (empty) values by clicking on “Remove
Empty”.

4. (Optional) Note that XBRL has enabled companies to tag all the financial data reported to the SEC.
Since they are supposed to refer to the standardized XBRL taxonomy for US reporting (as best they
can), notice that the [version] field shows the taxonomy usage in these filings.
a. Right-click [version] >> “Group By…” >> change the “New column name” to
“Count_Versions” and click OK.
b. Sort the Count column in Descending Order. Which taxonomy/year version is mostly
used in your dataset? ______________________
c. Delete steps a & b (“Grouped Rows” & “Sorted Rows”) from above.

5. (Optional) Note that the [tag] field has multiple tags (line item identifiers such as Assets and
Revenues). Use the grouping and sorting steps on the [tag] field to identify the most common line
item names used by companies in their financial statements.
a. Right-click [tag] >> “Group By…” >> change the “New column name” to
“Count_Tags” and click OK.
b. Sort the Count column in Descending Order. What is the most commonly used tag
(line item) amongst public companies and how many times was it used?
_____________ , _________
c. How popular are the “Assets” and “NetIncomeLoss” tags according to the preview (i.e.
how many rows of data contain each of these tags in this dataset)? “Assets” ________ ,
“NetIncomeLoss” _________
d. Delete steps a & b (“Grouped Rows” & “Sorted Rows”) from above.

6
6. In the [tag] field, you will need to apply text filters to retain data related to particular financial
statement line items. The tags to be selected are Assets and NetIncomeLoss. Note that your
instructor may ask you to use other tags such as “StockholdersEquity” and
“EarningsPerShareDiluted”. If at a later point, you need to extract additional line items (i.e.,
tags) for your analysis, you can return to this step, click on the widget to the right of the filter
procedure, and modify this text filter to include additional tags. Click on the field drop down for
[tag] >> select “text filter” >> “equals”, then use the “Advanced” option to specify the tag
names to be selected in a series of “or” filters. Your filters should appear as shown in the
following screenshot.

7. Use the procedure noted for the SUB.TXT file in Step II (A) [9]above to change the [ddate] field
to a date field.

8. Apply a filter to the [ddate] field to only select dates between 1/1/2019 and 1/1/2020.

9. The data may have duplicates since companies may have multiple registrations with the SEC if
they go through mergers and acquisitions. To indicate whether a registrant’s financial statements
relates to a subsidiary, the “NUM” file contains a co-registrant field named [coreg].

10. To eliminate duplicate records due to subsidiaries, locate the [coreg] field and apply a filter
so that only rows that are blank (have null values) are retained.

C. Joining the NUM file with the SUB file

1. In the Power Query editor “Home” menu, choose “Merge Queries” and “Merge queries as new” to
join the “SUB” and “NUM” files. The two files are to be joined on two fields: [adsh] and
[ddate] in “NUM” with [period] in “SUB”. To select multiple fields, hold down the Ctrl key as
you click on each field (column) in each file in the “merge as new” window.

7
2. You will see number 1 and number 2 next to the columns in both files, indicating which field in
each file is being matched with the corresponding field in the other file. Select “inner join” so
that the result only contains matching rows from both files. (Tip: for the merge to work
properly, the data type of each field should match across the two files).

3. Remove the [qtrs] field, as it is not needed and will result in multiple rows for firms that use both 0
(to signify balance sheet items) and 4 (to signify fourth quarter income statement items).

4. Scroll to the right of newly merged “query”. Depending on which file you picked as the first file
and which as the second, you will see the second file appear as the last column in Power Query
editor. However, the fields in the second file are not visible. As shown in the following screenshot,
click the circled icon to “expand” the file so that the columns appear (in my case the “sub” file was
the second file). When you click the “expand” icon you will see the fields from the second file.
Note that Power Query by default will attach the file name as a prefix in naming all of the columns
in the second file. The prefix can be left as-is, to easily distinguish which fields came from the
second file. Click “OK” to complete the expand operation.

5. Next, we will perform a transformation such that each [tag] will become its own column with
[value] populated in the respective columns. We call this procedure a Pivot of rows to columns.

8
6. Select the [tag] column and then click “Pivot Column” in the “Transform” menu on the
ribbon. Choose the [value] column with the numeric values as the “Values Column.” Under
“Advanced options,” in the “Aggregate Value Function” dropdown, select “Don’t aggregate”
since we want the individual numbers without any aggregation. After this step, you should see
the following four new columns: Assets, NetIncomeLoss (and any other tags your instructor has
specified). You will note that there are many null values, which means that that particular line item
was not reported by the firm in that specific 10-K filing.

7. In the expanded columns for the SUB.TXT file, check the data type of the [filed] column (it will
probably be named [sub.filed]). If it is a numeric field instead of a date field, use the procedure
described in step II [A] (9) earlier to convert it to a date field.

8. Before moving on, rename the query from “Merge1” to something more meaningful, e.g.,
“SUB_NUM.”

D. Connecting to XBRL_SOFT_EXT.xlsx file and joining it with SUB_NUM from above

1. We now need to bring in the XBRL_SOFT_EXT.xlsx file from step I [A] and merge it with the
joined NUM and SUB file (SUB_NUM). From the “Home” tab in Power Query, click “New
Source” at the far right and select File >> Excel. Then navigate to your folder containing the
XBRL_SOFT_EXT.xlsx file and select the file. Within the file, select the sheet that has the
relevant data. Power Query should automatically promote the headers and change data types as
necessary.

2. If the sheet does not have a meaningful name, use the properties section on the right of the
screen to rename it to “fiscal year 10-Ks”.

3. In the Power Query editor “Home” menu, choose “Merge Queries” and “Merge queries as
new” to join the “fiscal year 10-Ks” with the previous “SUB_NUM” file. The two files are to

9
be joined on two fields (i.e., a composite key): [CIK] and [filed]=[Date Filed]. As before, to
select multiple fields, hold down the Ctrl key as you click on each field (column) in each file
in the “merge as new” window. You will see number 1 and number 2 next to the columns in
both files, indicating which field in each file is being matched with the corresponding field in the
other file. Select “inner join” so that the result only contains matching rows from both files.
(Make sure the data types for each field match across the two files).

4. Repeat the procedure noted in step II [C](4) above to “expand” the columns in the “fiscal
year 10-Ks” file as shown below.

5. Uncheck “Use original column name as prefix” this time to avoid lengthy field names since the
sub source already exists in the field names.

6. Rename the final merged file to “SEC XBRL SUB_NUM merged.” The final merged file
should have around 34 columns (assuming you performed all of the steps noted above with no
additional “tags” included your instructor, including removing specific columns).

10
7. The final step in Power Query is to use the “Close & Load” feature to load all of the data
into an Excel spreadsheet.

3. You are now finished with the transformation step of the ETL process. The “SEC XBRL
SUB_NUM merged” sheet in the resulting Excel file is now ready for analysis in Tableau. You can
name the Excel file “XBRL software extensions with SEC data.xlsx” and save this cleaned file
on your local drive. This file should contain all sheets and the Power Query queries performed
above during the ETL process.

III. Loading (Opening) transformed dataset: The third step in the Extract, Transform and Load
(ETL) process (using Tableau Desktop)

A. Load the data into your analytics software (Tableau Desktop)

1. Open Tableau Desktop and make a “connection” to the Excel file “XBRL software extensions
with SEC data.xlsx.” You only need to bring in the “SEC XBRL SUB_NUM merged” table as the
data source in Tableau.

B. Making additional changes after “Loading” your dataset

1. Determining Industry Categories based on 2-digit Standard Industry Classification (SIC)


Codes
Most industries have more detailed sub-categories that may not be informative to examine at the
granular level. Therefore, follow this procedure to determine the 2-digit SIC Code from the
existing SIC field in your data in Tableau Desktop.
a. Click on the dropdown to the right of the field header for [SIC] and select “Create
Calculated Field”. Name the field “SIC-2digit.”

b. Enter the following formula [SIC]/100 and click OK.

11
c. A new calculated field is created showing the 4 digit code as a 2-digit code with two decimal
places. Click on the hashtag for the newly created field and change the data type to
Number (whole).

2. Creating an industry description field using the group option


Now that you have a more meaningful 2-digit classification, you can perform your analysis on the
first two digits of ViewDrive’s SIC code. However, it may be useful to have descriptive labels for
these 2-digit SIC codes to determine which industries they’re in without having to lookup each
code. Visit https://siccode.com/sic-code-lookup-directory to see the detailed description of the
major industry categories by 2-digit SIC code.
a. Although the 2-digit SIC codes represent a more general industry category for the companies in
your dataset, let us add the description of the industries using this breakdown.

b. Click on the dropdown of the [SIC-2digit] field and click on “Create groups…”

12
c. Rename the field to [Industry] and use the SHIFT key to select 1 through 9 and click on
“GROUP”. Type “Agriculture, Forestry, Fishing” to name your first group.

d. Select 10 through 14 and create another group. Type “Mining” as the name for this
group.

e. Repeat these steps for each of the SIC Code Categories until you have finished grouping
the codes up to the last, “Public Administration,” category. Note: you may not have 2-digit
SIC codes for 01 through 99, so just create groups for the 2-digit SIC codes in your data set.

13
Analysis (in Tableau Desktop)
Before you begin, you may want to watch some tutorial videos on Tableau Desktop by following the link
below:
https://www.tableau.com/learn/training

1. Dimensions and Measures: Measures are numeric values that we intend to perform calculations
and arithmetic on. On the other hand, dimensions allow us to group the calculations and can
become column and row labels.

2. Make appropriate changes to the data type of fields that Tableau improperly identified as being of
the wrong data type (if any).
For example, numbers that are unique codes such as [CIK] should be categorized as dimensions.
Confirm that [Assets] and [Percent extended] are classified as measures.

3. Zipcode could be classified as a dimension. Further, you may classify it as a geographic field by
right-clicking on it and selecting Geographic Role. This will enable you to draw maps as shown in
the visualization below.

4. Notice the “show me” option on the right top corner of the screen.

5. Selecting multiple fields of interest (using CTRL) on the left and clicking on “show me” results in
Tableau suggesting visualizations that it thinks are appropriate by greying out the rest.

6. Notice that Tableau resembles PivotTables in Excel where you can drag and drop fields into the
layout to summarize and aggregate data (rows, columns, filters, and values).

7. Start by creating descriptive statistics to familiarize yourself with the data. A sample visualization
utilizing the count of unique companies (count of CIK), an average of Assets, as well as average
percent extended grouped by XBRL software is shown below in a tabulated format. This option is
the top left option from the “show me” visualization options.

14
Please note that the table below is a sample and not representative of the actual data that your instructor
will assign. (These statistics should be what you get for the 2020q1 SEC data set).

8. Average, Std Dev, Median, Min, Max can optionally be calculated on the same table by dragging
the variable repeatedly into the canvas and changing the aggregation from sum to the intended one.

9. Calculated variables and bins (groupings) can also be created from fields to provide insights.

10. Note that multiple visualizations can be created on separate “pages” or “sheets.” As a data analyst,
you will need to determine the best way (sequence) to tell your story. You will need to decide how
to narrow your scope in subsequent visualizations (similar to how one may have multiple sheets in
an Excel workbook) with different levels of abstraction. Explore and learn about Tableau’s
dashboard and stories (watch videos for examples of excellence) to further refine the order and
presentation of your visualizations.

11. As you start building your visualizations, think of your presentation of the material. Should we be
looking at trends by company size, industry, or extensions, first?

12. When finished, save your Tableau file as <<YourLastName_YourFirstName>> using the “.twbx”
extension. You will need to submit this file to your instructor along with your report.

Reporting
Executive memo report (output of a consulting/advisory report)
Create an executive memo addressed to the controller of ViewDrive Corporation. Present clear and
concise information about the problem you are analyzing. Subheadings should include:
– Introduction
– Main points
– Conclusions and recommendations (You may use bullet points to highlight important points).

The visualizations in the appendix should have a logical flow and be referred to in the text of the
executive memo. Refer to the case requirements section for a detailed description of deliverables and
formatting.

15

You might also like