0% found this document useful (0 votes)
15 views14 pages

Unstructured Data: User Price Shipped

Uploaded by

virajjadhaocomp
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
15 views14 pages

Unstructured Data: User Price Shipped

Uploaded by

virajjadhaocomp
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 14

SQL (Structured Query Language) is a standard language for data operations that allows you to ask

questions and get insights from structured datasets. It's commonly used in database management
and allows you to perform tasks like transaction record writing into relational databases and
petabyte-scale data analysis.

Structured datasets have clear rules and formatting and often times are organized into tables, or data
that's formatted in rows and columns.

An example of unstructured data would be an image file. Unstructured data is inoperable with SQL
and cannot be stored in BigQuery datasets or tables (at least natively.) To work with image data (for
instance), you would use a service like Cloud Vision, perhaps through its API directly.

The following is an example of a structured dataset—a simple table:

User Price Shipped

Sean $35 Yes

Rocky $50 No

If you've had experience with Google Sheets, then the above should look quite similar. The table has
columns for User, Price, and Shipped and two rows that are composed of filled in column values.

A Database is essentially a collection of one or more tables. SQL is a structured database


management tool, but quite often (and in this lab) you will be running queries on one or a few tables
joined together—not on whole databases.

SELECT and FROM

SQL is phonetic by nature and before running a query, it's always helpful to first figure out what
question you want to ask your data (unless you're just exploring for fun.)

SQL has predefined keywords which you use to translate your question into the pseudo-english SQL
syntax so you can get the database engine to return the answer you want.

The most essential keywords are SELECT and FROM:

 Use SELECT to specify what fields you want to pull from your dataset.

 Use FROM to specify what table or tables you want to pull our data from.

An example may help understanding. Assume that you have the following table example_table,
which has columns USER, PRICE, and SHIPPED:
And say that you want to just pull the data that's found in the USER column. You can do this by
running the following query that uses SELECT and FROM:

SELECT USER FROM example_table

Copied!

content_copy

If you executed the above command, you would select all the names from the USER column that are
found in example_table.

You can also select multiple columns with the SQL SELECT keyword. Say that you want to pull the
data that's found in the USER and SHIPPED columns. To do this, modify the previous query by adding
another column value to our SELECT query (making sure it's separated by a comma!):

SELECT USER, SHIPPED FROM example_table

Copied!

content_copy

Running the above retrieves the USER and the SHIPPED data from memory:
WHERE

The WHERE keyword is another SQL command that filters tables for specific column values. Say that
you want to pull the names from example_table whose packages were shipped. You can supplement
the query with a WHERE, like the following:

SELECT USER FROM example_table WHERE SHIPPED='YES'

Copied!

content_copy

Running the above returns all USERs whose packages have been SHIPPED from memory:

Now that you have a baseline understanding of SQL's core keywords, apply what you've learned by
running these types of queries in the BigQuery console.

The BigQuery paradigm

BigQuery is a fully-managed petabyte-scale data warehouse that runs on the Google Cloud. Data
analysts and data scientists can quickly query and filter large datasets, aggregate results, and perform
complex operations without having to worry about setting up and managing servers. It comes in the
form of a command line tool (pre installed in cloudshell) or a web console—both ready for managing
and querying data housed in Google Cloud projects.

GROUP BY

The GROUP BY keyword will aggregate result-set rows that share common criteria (e.g. a column
value) and will return all of the unique entries found for such criteria.

This is a useful keyword for figuring out categorical information on tables.

SELECT start_station_name FROM `bigquery-public-data.london_bicycles.cycle_hire` GROUP BY


start_station_name;

Your results are a list of unique (non-duplicate) column values.


Without the GROUP BY, the query would have returned the full 83,434,866 rows. GROUP BY will
output the unique column values found in the table. You can see this for yourself by looking in the
bottom right corner. You will see 954 rows, meaning there are 954 distinct London bikeshare starting
points.

COUNT

The COUNT() function will return the number of rows that share the same criteria (e.g. column
value). This can be very useful in tandem with a GROUP BY.

Add the COUNT function to our previous query to figure out how many rides begin at each starting
point.

SELECT start_station_name, COUNT(*) FROM `bigquery-public-data.london_bicycles.cycle_hire`


GROUP BY start_station_name;

AS

SQL also has an AS keyword, which creates an alias of a table or column. An alias is a new name
that's given to the returned column or table—whatever AS specifies.

SELECT start_station_name, COUNT(*) AS num_starts FROM `bigquery-public-


data.london_bicycles.cycle_hire` GROUP BY start_station_name;

For Results, the right column name changed from COUNT(*) to num_starts.

As you see, the COUNT(*) column in the returned table is now set to the alias name num_starts. This
is a handy keyword to use especially if you are dealing with large sets of data — forgetting that an
ambiguous table or column name happens more often than you think!

ORDER BY

The ORDER BY keyword sorts the returned data from a query in ascending or descending order based
on a specified criteria or column value. Add this keyword to our previous query to do the following:

 Return a table that contains the number of bikeshare rides that begin at each starting
station, organized alphabetically by the starting station.

 Return a table that contains the number of bikeshare rides that begin at each starting
station, organized numerically from lowest to highest.

 Return a table that contains the number of bikeshare rides that begin at each starting
station, organized numerically from highest to lowest.

SELECT start_station_name, COUNT(*) AS num FROM `bigquery-public-


data.london_bicycles.cycle_hire` GROUP BY start_station_name ORDER BY start_station_name;
SELECT start_station_name, COUNT(*) AS num FROM `bigquery-public-
data.london_bicycles.cycle_hire` GROUP BY start_station_name ORDER BY num;

SELECT start_station_name, COUNT(*) AS num FROM `bigquery-public-


data.london_bicycles.cycle_hire` GROUP BY start_station_name ORDER BY num DESC;

Exporting queries as CSV files

Cloud SQL is a fully-managed database service that makes it easy to set up, maintain, manage, and
administer your relational PostgreSQL and MySQL databases in the cloud. There are two formats of
data accepted by Cloud SQL: dump files (.sql) or CSV files (.csv). You will learn how to export subsets
of the cycle_hire table into CSV files and upload them to Cloud Storage as an intermediate location.
Storing and querying massive datasets can be time consuming and expensive without the right
hardware and infrastructure. BigQuery is an enterprise data warehouse that solves this problem by
enabling super-fast SQL queries using the processing power of Google's infrastructure. Simply move
your data into BigQuery and let us handle the hard work. You can control access to both the project
and your data based on your business needs, such as giving others the ability to view or query your
data.

You can access BigQuery in the Console, the command-line tool, or by making calls to the BigQuery
REST API using a variety of client libraries such as Java, .NET, or Python. There are also a variety
of third-party tools that you can use to interact with BigQuery, such as visualizing the data or loading
the data.

The BigQuery console provides an interface to query tables, including public datasets offered by
BigQuery. The query you will run accesses a table from a public dataset that BigQuery provides. It
uses standard query language to search the dataset, and limits the results returned to 10.

Query a public dataset

1. Click Compose a New Query. Copy and paste the following query into the BigQuery Query
editor:

#standardSQL

SELECT

weight_pounds, state, year, gestation_weeks

FROM

`bigquery-public-data.samples.natality`

ORDER BY weight_pounds DESC LIMIT 10;

Copied!

content_copy

This data sample holds information about US natality (birth rates).

A green or red check displays depending on whether the query is valid or invalid. If the query is valid,
the validator also describes the amount of data to be processed after you run the query.

This information helps determine the cost to run a query.

2. Click the Run button.

Your query results should resemble the following:


Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB
home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your
Google Cloud resources.

1. Click Activate Cloud Shell at the top of the Google Cloud console.

When you are connected, you are already authenticated, and the project is set to
your Project_ID, PROJECT_ID. The output contains a line that declares the Project_ID for this session:

Your Cloud Platform project in this session is set to "PROJECT_ID"

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports
tab-completion.

2. (Optional) You can list the active account name with this command:

gcloud auth list

Copied!

content_copy

3. Click Authorize.

Output:

ACTIVE: *

ACCOUNT: "ACCOUNT"
To set the active account, run:

$ gcloud config set account `ACCOUNT`

4. (Optional) You can list the project ID with this command:

gcloud config list project

Copied!

content_copy

Output:

[core]

project = "PROJECT_ID"

Examine a table

BigQuery offers a number of sample tables that you can run queries against. In this lab, you'll run
queries against the shakespeare table, which contains an entry for every word in every play.

To examine the schema of the Shakespeare table in the samples dataset, run:

bq show bigquery-public-data:samples.shakespeare

Copied!

content_copy

In this command you're doing the following:

 bq to invoke the BigQuery command line tool

 show is the action

 Then you're listing the name of the project:public dataset.table in BigQuery that you want to
see.
Run the help command

When you include a command name with the help commands, you get information about that
specific command.

1. For example, the following call to bq help retrieves information about the query command:

bq help query

Copied!

content_copy

2. To see a list of all of the commands bq uses, run just bq help.

Output:

Last modified Schema Total Rows Total Bytes Expiration Time Partitioning
Clustered Fields Labels

----------------- ------------------------------------ ------------ ------------- ------------ ------------------- ------------------


--------

14 Mar 13:16:45 |- word: string (required) 164656 6432064

|- word_count: integer (required)

|- corpus: string (required)

|- corpus_date: integer (required)


Run a query

Now you'll run a query to see how many times the substring "raisin" appears in Shakespeare's works.

1. To run a query, run the command bq query "[SQL_STATEMENT]":

 Escape any quotation marks inside the [SQL_STATEMENT] with a \ mark, or

 Use a different quotation mark type than the surrounding marks ("versus").

2. Run the following standard SQL query in Cloud Shell to count the number of times that the
substring "raisin" appears in all of Shakespeare's works:

bq query --use_legacy_sql=false \

'SELECT

word,

SUM(word_count) AS count

FROM

`bigquery-public-data`.samples.shakespeare

WHERE

word LIKE "%raisin%"

GROUP BY

word'

In this command:

 --use_legacy_sql=false makes standard SQL the default query syntax.

Output:

Waiting on job_e19 ... (0s) Current status: DONE

+---------------+-------+

| word | count |

+---------------+-------+

| praising | 8|

| Praising | 4|

| raising | 5|
| dispraising | 2|

| dispraisingly | 1|

| raisins | 1|

The table demonstrates that although the actual word raisin doesn't appear, the letters appear in
order in several of Shakespeare's works.

Create a new table

Now create your own table. Every table is stored inside a dataset. A dataset is a group of resources,
such as tables and views.

Create a new dataset

1. Use the bq ls command to list any existing datasets in your project:

bq ls

content_copy

You will be brought back to the command line since there aren't any datasets in your project yet.

2. Run bq ls and the bigquery-public-data Project ID to list the datasets in that specific project,
followed by a colon (:):

bq ls bigquery-public-data:

Output:

datasetId

-----------------------------

austin_311

austin_bikeshare

austin_crime

austin_incidents

austin_waste

baseball

bitcoin_blockchain

bls

census_bureau_construction
census_bureau_international

census_bureau_usa

census_utility

chicago_crime

...

Now create a dataset. A dataset name can be up to 1,024 characters long, and consist of A-Z, a-z, 0-9,
and the underscore, but it cannot start with a number or underscore, or have spaces.

3. Use the bq mk command to create a new dataset named babynames in your project:

bq mk babynames

Copied!

content_copy

Sample output:

Dataset 'qwiklabs-gcp-ba3466847fe3cec0:babynames' successfully created.

Run queries

Now you're ready to query the data and return some interesting results.

1. Run the following command to return the top 5 most popular girls names:

bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'F' ORDER BY count
DESC LIMIT 5"

Copied!

content_copy

Output:

Waiting on job_58c0f5ca52764ef1902eba611b71c651 ... (0s) Current status: DONE

+----------+-------+

| name | count |

+----------+-------+

| Isabella | 22913 |

| Sophia | 20643 |

| Emma | 17345 |

| Olivia | 17028 |

| Ava | 15433 |
+----------+-------+

2. Run the following command to see the top 5 most unusual boys names:

bq query "SELECT name,count FROM babynames.names2010 WHERE gender = 'M' ORDER BY count
ASC LIMIT 5"

Copied!

content_copy

Note: The minimum count is 5 because the source data omits names with fewer than 5 occurrences.

Output:

Waiting on job_556ba2e5aad340a7b2818c3e3280b7a3 ... (1s) Current status: DONE

+----------+-------+

| name | count |

+----------+-------+

| Aaqib | 5|

| Aaidan | 5|

| Aadhavan | 5|

| Aarian | 5|

| Aamarion | 5|

+----------+-------+

You might also like