Unstructured Data: User Price Shipped
Unstructured Data: User Price Shipped
questions and get insights from structured datasets. It's commonly used in database management
and allows you to perform tasks like transaction record writing into relational databases and
petabyte-scale data analysis.
Structured datasets have clear rules and formatting and often times are organized into tables, or data
that's formatted in rows and columns.
An example of unstructured data would be an image file. Unstructured data is inoperable with SQL
and cannot be stored in BigQuery datasets or tables (at least natively.) To work with image data (for
instance), you would use a service like Cloud Vision, perhaps through its API directly.
Rocky $50 No
If you've had experience with Google Sheets, then the above should look quite similar. The table has
columns for User, Price, and Shipped and two rows that are composed of filled in column values.
SQL is phonetic by nature and before running a query, it's always helpful to first figure out what
question you want to ask your data (unless you're just exploring for fun.)
SQL has predefined keywords which you use to translate your question into the pseudo-english SQL
syntax so you can get the database engine to return the answer you want.
Use SELECT to specify what fields you want to pull from your dataset.
Use FROM to specify what table or tables you want to pull our data from.
An example may help understanding. Assume that you have the following table example_table,
which has columns USER, PRICE, and SHIPPED:
And say that you want to just pull the data that's found in the USER column. You can do this by
running the following query that uses SELECT and FROM:
Copied!
content_copy
If you executed the above command, you would select all the names from the USER column that are
found in example_table.
You can also select multiple columns with the SQL SELECT keyword. Say that you want to pull the
data that's found in the USER and SHIPPED columns. To do this, modify the previous query by adding
another column value to our SELECT query (making sure it's separated by a comma!):
Copied!
content_copy
Running the above retrieves the USER and the SHIPPED data from memory:
WHERE
The WHERE keyword is another SQL command that filters tables for specific column values. Say that
you want to pull the names from example_table whose packages were shipped. You can supplement
the query with a WHERE, like the following:
Copied!
content_copy
Running the above returns all USERs whose packages have been SHIPPED from memory:
Now that you have a baseline understanding of SQL's core keywords, apply what you've learned by
running these types of queries in the BigQuery console.
BigQuery is a fully-managed petabyte-scale data warehouse that runs on the Google Cloud. Data
analysts and data scientists can quickly query and filter large datasets, aggregate results, and perform
complex operations without having to worry about setting up and managing servers. It comes in the
form of a command line tool (pre installed in cloudshell) or a web console—both ready for managing
and querying data housed in Google Cloud projects.
GROUP BY
The GROUP BY keyword will aggregate result-set rows that share common criteria (e.g. a column
value) and will return all of the unique entries found for such criteria.
COUNT
The COUNT() function will return the number of rows that share the same criteria (e.g. column
value). This can be very useful in tandem with a GROUP BY.
Add the COUNT function to our previous query to figure out how many rides begin at each starting
point.
AS
SQL also has an AS keyword, which creates an alias of a table or column. An alias is a new name
that's given to the returned column or table—whatever AS specifies.
For Results, the right column name changed from COUNT(*) to num_starts.
As you see, the COUNT(*) column in the returned table is now set to the alias name num_starts. This
is a handy keyword to use especially if you are dealing with large sets of data — forgetting that an
ambiguous table or column name happens more often than you think!
ORDER BY
The ORDER BY keyword sorts the returned data from a query in ascending or descending order based
on a specified criteria or column value. Add this keyword to our previous query to do the following:
Return a table that contains the number of bikeshare rides that begin at each starting
station, organized alphabetically by the starting station.
Return a table that contains the number of bikeshare rides that begin at each starting
station, organized numerically from lowest to highest.
Return a table that contains the number of bikeshare rides that begin at each starting
station, organized numerically from highest to lowest.
Cloud SQL is a fully-managed database service that makes it easy to set up, maintain, manage, and
administer your relational PostgreSQL and MySQL databases in the cloud. There are two formats of
data accepted by Cloud SQL: dump files (.sql) or CSV files (.csv). You will learn how to export subsets
of the cycle_hire table into CSV files and upload them to Cloud Storage as an intermediate location.
Storing and querying massive datasets can be time consuming and expensive without the right
hardware and infrastructure. BigQuery is an enterprise data warehouse that solves this problem by
enabling super-fast SQL queries using the processing power of Google's infrastructure. Simply move
your data into BigQuery and let us handle the hard work. You can control access to both the project
and your data based on your business needs, such as giving others the ability to view or query your
data.
You can access BigQuery in the Console, the command-line tool, or by making calls to the BigQuery
REST API using a variety of client libraries such as Java, .NET, or Python. There are also a variety
of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading
the data.
The BigQuery console provides an interface to query tables, including public datasets offered by
BigQuery. The query you will run accesses a table from a public dataset that BigQuery provides. It
uses standard query language to search the dataset, and limits the results returned to 10.
1. Click Compose a New Query. Copy and paste the following query into the BigQuery Query
editor:
#standardSQL
SELECT
FROM
`bigquery-public-data.samples.natality`
Copied!
content_copy
A green or red check displays depending on whether the query is valid or invalid. If the query is valid,
the validator also describes the amount of data to be processed after you run the query.
Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB
home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your
Google Cloud resources.
1. Click Activate Cloud Shell at the top of the Google Cloud console.
When you are connected, you are already authenticated, and the project is set to
your Project_ID, PROJECT_ID. The output contains a line that declares the Project_ID for this session:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports
tab-completion.
2. (Optional) You can list the active account name with this command:
Copied!
content_copy
3. Click Authorize.
Output:
ACTIVE: *
ACCOUNT: "ACCOUNT"
To set the active account, run:
Copied!
content_copy
Output:
[core]
project = "PROJECT_ID"
Examine a table
BigQuery offers a number of sample tables that you can run queries against. In this lab, you'll run
queries against the shakespeare table, which contains an entry for every word in every play.
To examine the schema of the Shakespeare table in the samples dataset, run:
bq show bigquery-public-data:samples.shakespeare
Copied!
content_copy
Then you're listing the name of the project:public dataset.table in BigQuery that you want to
see.
Run the help command
When you include a command name with the help commands, you get information about that
specific command.
1. For example, the following call to bq help retrieves information about the query command:
bq help query
Copied!
content_copy
Output:
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning
Clustered Fields Labels
Now you'll run a query to see how many times the substring "raisin" appears in Shakespeare's works.
Use a different quotation mark type than the surrounding marks ("versus").
2. Run the following standard SQL query in Cloud Shell to count the number of times that the
substring "raisin" appears in all of Shakespeare's works:
bq query --use_legacy_sql=false \
'SELECT
word,
SUM(word_count) AS count
FROM
`bigquery-public-data`.samples.shakespeare
WHERE
GROUP BY
word'
In this command:
Output:
+---------------+-------+
| word | count |
+---------------+-------+
| praising | 8|
| Praising | 4|
| raising | 5|
| dispraising | 2|
| dispraisingly | 1|
| raisins | 1|
The table demonstrates that although the actual word raisin doesn't appear, the letters appear in
order in several of Shakespeare's works.
Now create your own table. Every table is stored inside a dataset. A dataset is a group of resources,
such as tables and views.
bq ls
content_copy
You will be brought back to the command line since there aren't any datasets in your project yet.
2. Run bq ls and the bigquery-public-data Project ID to list the datasets in that specific project,
followed by a colon (:):
bq ls bigquery-public-data:
Output:
datasetId
-----------------------------
austin_311
austin_bikeshare
austin_crime
austin_incidents
austin_waste
baseball
bitcoin_blockchain
bls
census_bureau_construction
census_bureau_international
census_bureau_usa
census_utility
chicago_crime
...
Now create a dataset. A dataset name can be up to 1,024 characters long, and consist of A-Z, a-z, 0-9,
and the underscore, but it cannot start with a number or underscore, or have spaces.
3. Use the bq mk command to create a new dataset named babynames in your project:
bq mk babynames
Copied!
content_copy
Sample output:
Run queries
Now you're ready to query the data and return some interesting results.
1. Run the following command to return the top 5 most popular girls names:
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'F' ORDER BY count
DESC LIMIT 5"
Copied!
content_copy
Output:
+----------+-------+
| name | count |
+----------+-------+
| Isabella | 22913 |
| Sophia | 20643 |
| Emma | 17345 |
| Olivia | 17028 |
| Ava | 15433 |
+----------+-------+
2. Run the following command to see the top 5 most unusual boys names:
bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' ORDER BY count
ASC LIMIT 5"
Copied!
content_copy
Note: The minimum count is 5 because the source data omits names with fewer than 5 occurrences.
Output:
+----------+-------+
| name | count |
+----------+-------+
| Aaqib | 5|
| Aaidan | 5|
| Aadhavan | 5|
| Aarian | 5|
| Aamarion | 5|
+----------+-------+