0% found this document useful (0 votes)
171 views

Manage Data Access With Unity Catalog

This document provides an overview of Unity Catalog and how it addresses challenges with data governance. Unity Catalog provides a unified approach to governing data and AI assets across clouds and platforms through centralized access controls, auditing, and management of permissions. It integrates with the Databricks lake house platform to simplify data governance and improve security, access management, and visibility of data usage.
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views

Manage Data Access With Unity Catalog

This document provides an overview of Unity Catalog and how it addresses challenges with data governance. Unity Catalog provides a unified approach to governing data and AI assets across clouds and platforms through centralized access controls, auditing, and management of permissions. It integrates with the Databricks lake house platform to simplify data governance and improve security, access management, and visibility of data usage.
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 17

In this series of videos, we'll teach you key concepts about Unity Catalog,

including how it integrates with the Databricks platform.


How to access Unity Catalog through clusters and SQL warehouses.
How to create and govern data assets in Unity Catalog.
And finally, we'll review Databricks recommendations for your organization's Unity
Catalog based solutions.
Let's get started.
Before we dive into the specifics of Unity Catalog, let's take a moment to discuss
the more general topic of Databricks.
80% of organizations seeking to scale digital businesses will fail because they do
not take a modern approach to data and analytics governance.
Data governance refers to the process of managing the availability, usability,
integrity, and security of data.
Data governance has become extremely important in today's data-driven landscape,
and it's needed to establish trustworthy data that is readily accessible to those
who need it, in accordance with internally defined policies or external regulatory
requirements.
When we talk about data governance, we can generally divide the discussion into
four main areas.
First is data access control.
Organizations need to be able to lock down data and data generating artifacts such
as files, tables, and machine learning models.
Permission to access should be granted only to those who need it.
Second is data access auditing.
Any data governance program requires knowledge of how data is being used.
That doesn't just mean who is accessing the data, but when and how it's being used.
Third is data lineage.
The journey your data takes from its origin as it moves through your pipelines is
referred to as data lineage, and being able to trace it is very beneficial in a
number of ways.
First, it's much easier to identify root causes of processing issues.
It enables you to predict the impact of proposed upstream changes to your
processing.
Third, it helps compliance teams prove that data and reports come from trusted and
verifiable sources.
And finally, it fuels better understanding of the data for data analysts,
scientists, and engineers.
And the fourth way, data discovery.
Data computing data is always a challenge, particularly in a data lake.
But a data governance program requires a readily accessible inventory of data
assets that can be easily searched.
Since the emergence of the data lake paradigm, data governance in this context has
been complex, perhaps unnecessarily so.
Many organizations employ a data lake alongside a data warehouse, which leads to
data duplicated and hence data drift.
Governance happens at two different levels across these platforms, both of which
often employ fundamentally different tool stacks.
This makes it hard to collaborate and control things consistently.
Data and AI governance today is complex.
And we're looking to serve all of these different types of technology, and we're
trying to serve them, and this creates a really complex environment for governance.
In a typical enterprise today, you have a lot of data stored in data lakes, for
example, AWS S3.
To control permissions on data in the data lake, you set permissions on files and
directories.
It means you cannot set fine-grained permissions on rows and columns.
Because governance controls are at the file level, data teams must carefully
structure their data layout to support the desired policies.
For example, a team might partition data into different directories by country and
give access to each directory to different groups.
But what should the team do when governance rules change?
If different states inside one country adopt different data regulations, the
organization may need to restructure all its directories and files.
So it becomes complex to rewrite all the policy changes.
With most data lakes, you not only have files, but you also have metadata.
For example, you might have a hive metaphor that keeps track of table definitions
and views.
So you have to get permission on tables and views.
This can actually go out of sync with underlying data.
There's no guarantee that if users have access permissions on files, then they will
have permission on the corresponding table, or vice versa.
It becomes very confusing to manage these permissions.
Then you might have your data warehouse, where the permissions are more fine-
grained on tables, columns, and views.
But again, it's a different governance model.
In a typical scenario, you have data movement from your data lakes and data
warehouses.
And now you've created data silos with data movement across two systems, each with
a different governance model.
As a result, your governance method is inconsistent and prone to errors, making it
difficult to manage permissions, conduct audits, or discover and share data.
But data isn't limited to files or tables.
You also have assets like dashboards, machine learning models, and notebooks, each
with its own permission models and tech stack, making it difficult to manage access
permissions for all these assets consistently.
The problem gets bigger when your data assets exist across multiple clouds with
different access management solutions.
What you need is a unified approach to simplifying data governance for data and AI.
To address these challenges, we created Unity Catalog, which gives a unified
governance layer for all data and AI assets in your lake house.
Unity Catalog provides a single interface to manage permissions or auditing for
your data and AI assets.
So how does Unity Catalog rise to these challenges?
By integrating a centralized hub within the Databricks lake house platform for
administering and securing data and auditing your data.
You can unify governance across clouds. Unity Catalog blends in with the Databricks
lake house platform and allows you to define your data access rules once, where
they can be applied across multiple workspaces, clouds, languages, and use cases,
governing all your data assets in a consistent way through easy to use user
interfaces and SQL APIs.
The access controls available go far beyond what any cloud system can provide.
Unity Catalog provides access control for all managed data assets, including files,
tables, rows, columns within tables, and views.
While Databricks has provided some degree of access control in the past, its
security model is permissive by default and required careful administration of the
access control lists and the compute resources accessing the data to yield a fully
secure solution.
The access control rules are defined as a property of a workspace, so reliably
scaling this out to a multi workspace or multi cloud environment is a real
challenge.
Unity Catalog fixes all that. It lives outside the workspace and therefore spans
workspaces and clouds.
It's secured by default and doesn't rely on the cooperation of compute resources to
enforce access control.
You can also unify data and AI assets. While your data is free to roam, Unity
Catalog can manage it all from a central hub in a consistent and familiar way.
There's no need to replicate or translate your security requirements across
different systems. If rules change, those changes only need to be applied once.
Unity Catalog also brings with it the ability to do fine-grained auditing of all
queries performed to support data governance use cases.
Data lineage will also be collected and visualized for tables and columns across
all languages.
And finally, you can unify existing catalogs. Unity Catalog is additive and works
seamlessly with other meta stores, including the Legacy Hive meta store that comes
with each data.
It's a simple operation that can be done whenever you like.
Now that we understand some of the benefits Unity Catalog brings, let's look at
some of the key elements of Unity Catalog that are important to understanding how
Unity Catalog works.
First is the meta store. The meta store is the top of the system.
The meta store is the top-level logical container in Unity Catalog. It's a
construct that represents the metadata, that is, information about the objects
being managed by the meta store, as well as the access control lists that govern
access to those objects.
Don't confuse the Unity Catalog meta store with the Hive meta store. The Hive meta
store, which will be familiar to existing Databricks users, is the traditional
default meta store linked to each Databricks workspace.
And while it may seem functionally similar to a Unity Catalog meta store, Unity
Catalog meta stores offer improved security and auditing capabilities, as well as
an improved security model and a host of other features.
It's best to think of a meta store as a logical construct for organizing your data
and its associated metadata, rather than a physical container itself.
This theme is reflected in the physical makeup of a meta store, and accepting this
idea will also make some of our best practices more sensible, which we'll cover
later in this lesson.
The meta store essentially functions as a reference or a collection of meta data
and a link to a cloud storage container.
The metadata, that is, information about the data objects, like a table's columns,
data types, for examples, as well as the access control lists for those objects is
stored in the control plane.
This ties a meta store to a cloud region. This is why account administrators are
prompted for a region when creating a meta store.
Data related objects managed by the meta store are stored in a cloud storage
container, which is also configured as part of the meta store setup.
We won't dive much further into these details at the present time, though we'll
revisit these concepts throughout the lesson where they apply.
An important property of Unity Catalog is that it is additive and does not preclude
access to your existing data objects stored in your local meta store.
Once a workspace is connected to Unity Catalog through assignment of a Unity
Catalog meta store, Unity Catalog presents the Hive meta store as a special catalog
named Hive underscore Meta store.
Assets within the Hive meta store can be seamlessly referenced by specifying the
name Hive underscore Meta store as the first part of the three level name space, or
if set as the default catalog, two level references work as they always have.
Of course, Unity Catalog will not enforce any access control on assets stored
there, though legacy table access control lists are supported in some compute
configurations, and we'll talk more about this later.
With a high level understanding of what a meta store is, let's now drill into the
data object hierarchy, starting with the catalog.
A catalog is the top most container for data objects in Unity Catalog and forms the
first part of the three level name space that we'll see again and again throughout
this module.
Before we go any further, let's take a quick moment to talk more about the three
level name space, existing Databricks users, and for that matter anyone familiar
with SQL, understand the traditional two level name spaces used to address cables
within schemas.
Unity Catalog introduces this third level to provide improved data segregation
capabilities, correspondingly complete SQL references in Unity Catalog use three
levels.
Getting back to the catalog, meta stores can have as many catalogs as desired.
Catalogs are containers, but they can only contain schemas, which we'll talk about
next.
The concept of a schema, which is sometimes referred to as a database, is part of
traditional SQL and is unchanged by Unity Catalog.
It functions as a container for data bearing assets like tables and views, and
forms the second part of the three level name space we just saw.
Catalogs can contain as many schemas as desired, which in turn can contain as many
data objects as desired.
At the bottom layer of the hierarchy, let's first talk about tables.
Existing Databricks users and SQL developers understand tables well. These are SQL
relations consisting of an ordered list of columns, and the overall concept of them
is unchanged by Unity Catalog.
Although tables do have variations in Databricks, and to fully understand these,
it's important to recognize that tables are fully defined by two distinct elements.
First, the metadata, or the information about the table, including the list of
columns and their associated data types.
And then we have the data that populates the rows of the table, originating from
data files stored in Cloud Object Storage.
With these elements in mind, let's now talk about the two table variations that
existing Databricks users will likely be familiar with, managed and external
tables.
In both cases, the table metadata is managed by the Metastore in the control plane.
Dropping the table always means discarding the metadata relating to the table.
The difference between the variations amounts to where the table data resides. In
the case of a managed table, data files are stored in the managed storage location,
that is, the Cloud Storage container backing the Metastore that we mentioned
previously.
With an external table, data files are stored in some other Cloud Storage location
supplied by the user. The ability to completely decouple the data like this is
important in some use cases and for many of our users.
Otherwise, operating on managed versus external tables is largely similar, with the
notable exception of dropping the table.
In the case of an external table, the underlying table data is left intact. Only
the table metadata is discarded.
If you were to recreate a table using the same location, the table data will be
appeared in the state it was when the table was dropped.
Dropping a managed table, on the other hand, discards both the metadata and the
data. Views. Existing Databricks users and SQL developers will also likely
understand views well.
Views are essentially stored queries that are executed when you query the view.
Views perform arbitrary SQL transformation of tables or other views, which are
read-only.
They do not have the ability to modify the underlying data. The final elements in
the data object hierarchy are user-defined functions, or UDFs.
UDFs enable you to encapsulate custom functionality into a function that can be
invoked within queries.
We also have the storage credential and external locations. And we'll talk about
storage credentials first.
Since all data lives in the cloud, Unity Catalog means a way to authenticate with
Cloud Storage containers.
This is true whether you're using the default storage location configured as part
of a Metastore or arbitrary user-supplied Cloud Storage.
The storage credential satisfies the requirement of encapsulating an authentication
method to access Cloud Storage.
Each Metastore has access to external Cloud Storage to support external tables or
file access.
While storage credentials fulfill a critically important requirement, they're not
very convenient in terms of access control when dealing with external Cloud
Storage.
Storage credentials apply to the entire Cloud Storage container they reference,
therefore granting privileges on them applied to the entire container.
It's typically desirable to achieve finer-grained control.
External locations build on the concept of a storage credential and extend it with
a storage path with the Cloud Storage container,
allowing users to arbitrarily subdivide containers into smaller pieces and exercise
control over each subdivision.
External locations can be used to support and manage external tables.
They can also be used to govern direct access to files stored in Cloud Storage.
And finally, we'll talk about shares and recipients.
Shares and recipients relate to Delta Sharing, an open protocol developed by
Databricks for secure, low overhead data sharing across organizations.
It's intrinsically built into Unity Catalog and is used to explicitly declare
shares, that is a read-only logical collection of tables.
These can be shared with one or more recipients, that is a data reader outside the
organization.
We mention these constructs here since they are part of the Unity Catalog security
model.
And though Delta Sharing will be mentioned here and there, we won't put a heavy
focus on it in this module.
Databricks Academy offers a separate training focused on Delta Sharing.
Let's now talk about the Unity Catalog architecture.
So how does Unity Catalog fit into the Databricks landscape?
Well, let's talk about life before Unity Catalog.
On the left, we see an overview of how things looked prior to the introduction of
Unity Catalog.
Prior to Unity Catalog, user management and data management were a function of the
workspace.
Users and groups were defined within a particular workspace, either manually or
automatically, through a connection to an identity provider of some sort.
But the key point here is that users and groups are ingrained into a single
workspace, with no intrinsic synchronization in a multiple workspace environment.
Access Control was provided through a cooperative effort between a Metastore, that
is the repository that stores information about your data, and the compute
resources accessing the data, that is your clusters and single warehouses.
By default, this Metastore was furnished by a Hive Metastore local to each
workspace, although integrating an external Metastore was also supported.
One key element to note here are the locality of the Metastore, which makes it
challenging to easily and consistently apply the same set of security control
across multiple workspaces.
Furthermore, full security requires cooperation between compute resources and the
Metastore.
Your clusters must be configured to allow access control rules in order for access
control to be enforced.
If compute resources are not appropriately configured, or users are allowed
unrestricted configuration of their own clusters, then access rules can be
bypassed.
By contrast, Unity Catalog, with its enhanced security model, sits on its own and
factors the security-related elements out of the workspace, delivering the defined
ones,
principal, we discussed earlier.
In this architecture, users and groups for Unity Catalog are managed through the
Account Console, manually or through an identity provider, and then assigned to one
or more workspaces.
Metastores are likely factored out of the workspace, and managed through the
Account Console, where they can be assigned to workspaces.
Any one Metastore can be assigned to more than one workspace, enabling multiple
workspaces to share the same access control lists.
Compute resources that can connect to Unity Catalog will, by default, be subject to
Unity Catalog's security constraints.
No specific configuration or administration is needed to make the system secure.
Furthermore, any changes to security policies defined in a Metastore are
automatically and immediately propagated to all assigned workspaces and their
associated clusters or SQL warehouses.
Having discussed all the key concepts related to Unity Catalog, let's take a final
look at its security model in action.
Let's begin by touring the lifecycle of a query to see how Unity Catalog provides
access control in a secure yet performing way.
The story begins with a principal issuing a query.
Queries can be issued through all purpose clusters, for cases when users are
running Python or SQL workloads interactively.
In the case of a job or pipeline running as a service principal, this would
typically run through a job cluster.
Alternatively, data analysts may issue queries in Databricks SQL through a SQL
warehouse, or the query could be originating from a BI tool connected to a SQL
warehouse.
In any case, the applicable compute resource begins processing the query.
Next, the request is dispatched to Unity Catalog, which in turn logs the request
and validates the query against all security constraints defined within the
Metastore, to which the compute resource is associated.
For each object referenced in the query, Unity Catalog assumes the appropriate
Cloud credential governing that object, as provided by a Cloud Administrator.
For managed tables, this could be the Cloud Storage associated with the Metastore.
For files or external tables, this would be an external location governed by a
storage credential.
Again, for each object referenced in the query, Unity Catalog generates a scoped
temporary token to enable a client to access the data directly from the storage and
return that token along with an access URL.
That allows the cluster or SQL warehouse to access data directly but securely.
The cluster or SQL warehouse requests data directly from Cloud Storage using the
URL and token passed back from Unity Catalog.
Data is transferred back from Cloud Storage. This request process is repeated for
each object referenced by the query.
With access to data at the partition level, last mile row or column-based filtering
is applied on the cluster or SQL warehouse.
And finally, filtered results are passed back to the caller.
In this module, we'll discuss considerations related to connecting the two main
Databricks compute resources, clusters and SQL warehouses to Unity Catalog.
Next, let's talk about compute resources in Unity Catalog.
The main distinction Unity Catalog brings to cluster configuration is the
introduction of the access mode parameter.
Let's talk about that now.
All users, regardless of whether or not they have used Databricks before, must be
familiar with the new access mode parameter and its available options.
For existing Databricks users, the introduction of this new parameter alleviates
the previous overloading of the cluster mode parameter, thus eliminating the high
concurrency cluster mode.
While there are three security mode options to choose from, only two are relevant
as far as Unity Catalog is concerned.
These are single user and share.
As the name suggests, clusters using the single user mode can only be used by a
single user, who is designated when creating the cluster.
The designated user can also be edited after the fact.
The main point here is that this setting is independent of any other cluster access
control provided by the workspace, and only the designated user can attach to the
cluster.
The main advantage of single user clusters are their language support and support
for all features that could otherwise compromise the environment of a shared
cluster.
These features include in its groups, library installation, and DBFS fuse mounts.
Note, however, that dynamic use, which we'll talk about later, are not supported
currently on single user clusters.
Share clusters can be shared by multiple users, but only Python and SQL workloads
are allowed.
Some advanced cluster features, such as library installation, in its groups, DBFS
fuse mounts, are disabled to ensure security isolation among users of the cluster.
Notebook-level installations will still work, but cluster installations will not.
Do not use the remaining access mode, titled No Isolation Share, if you are running
workloads that need to access Unity Catalog.
We provided this matrix for reference to aid in selecting an appropriate security
mode, with the shaded rows representing modes that support Unity Catalog access.
Your choice will be primarily driven by the type of workloads you want to run.
Single user clusters are best for automated jobs or general-purpose work when you
need features that user isolation doesn't support.
Remember that this mode requires a cluster for each user, since only the designated
user can attach.
For interactive development that relies on Python or SQL, you'll likely want to
choose share, unless you need advanced features not supported in that mode.
It's also important to note that dynamic use, which we'll discuss in detail later,
is only supported in this mode.
If you need the row or column protection that dynamic use offer, then you need to
choose this option.
Let's now talk about roles and identities in Unity Catalog.
Cloud administrators have the ability to administer and control the cloud resources
that Unity Catalog leverages.
These entities have different names depending on which cloud you're working with.
But they include storage accounts or buckets, and IAM rules or service principles,
or managed identities.
As it relates to Unity Catalog, cloud administrators are involved in setting up the
resources needed to support new medicines.
They're responsible for creating a cloud storage container and setting up so that
Databricks can access the container in a secure manner.
D-dile responsibilities and tasks associated with this role are cloud-specific and
outside the scope of this lesson.
Identity administrators have the ability to administer users and groups in the
identity provider when one is in use.
This service provisions identities at the account level through SCIM connectors, so
that identities do not need to be created and managed manually.
This setup is performed and maintained by identity administrators and is dependent
on the identity provider in use.
The exact responsibilities and tasks associated with this role are also outside the
scope of the screening.
Account administrators have the ability to administer and control anything at the
account level.
As far as Unity Catalog is concerned, this includes the following tasks.
Creating and managing metasource, based on resources set up by a cloud
administrator, assigning metasource to workspaces, and managing users or groups,
or setting up interaction with an identity provider or automated identity
management, a task done in cooperation with an identity administrator.
Account administrators also have full access to all data objects with the added
ability to grant privileges and change ownerships.
The initial account administrator is designated when setting up Databricks for the
first time in your organization.
Regular users can be elevated to account administrator by an existing account
administrator through the setting of an attribute in the user's profile.
Every metastore has an administrator too, who by default is the account
administrator that created the metastore.
The metastore administrator has the ability to create catalogs and other data
objects within the metastore they own.
They have full access to all data objects within the metastore with the added
ability to grant privileges and change ownerships.
Essentially, they have the same abilities as account administrators, but only
within the metastore they own, where account administrators have those abilities
over all metastores.
The metastore administrator can be changed by the current metastore administrator
or an account administrator.
It's best practice to designate a group as a metastore admin rather than an
individual, but we'll go into a few more examples.
We'll go into more details on this shortly, and finally each data object also has
an owner, who by default is the principal who created the object.
Data owners have the ability to perform grants on data objects they own, as well as
create new nested objects.
For example, a schema owner can create a table within that schema.
Owners also have the ability to grant privileges and change ownerships over the
objects they own.
Data ownership can be changed by the owner, the metastore administrator, or any
account administrator.
Though this final role has little to do with data governance, we'll talk about it
quickly to round out the big picture.
Workspace administrators have the ability to perform administrative tasks on
specific workspaces.
These tasks include, administrating permissions on assets and compute resources
within the workspace,
defining cluster policies that limit users' ability to create their own clusters,
adding or removing user assignments,
elevating user permissions within a workspace, and changing job ownerships.
Workspace administrators can be designated by account administrators when assigning
users to the workspace,
though a workspace administrator can also elevate a user within the workspace they
administer.
A user in Databricks corresponds to an individual physical user of the system, that
is, a person.
Users authenticate using their email and password and interact with the platform
through the user interface.
Users can also access functionality through command-line tools and REST APIs.
User identities in Databricks are fairly straightforward.
Users are uniquely identified by email address and can carry additional
information, at a minimum, a first and last name, to make identities more readable.
As we mentioned previously, account administrators have the ability to perform
several administrative tasks important to Unity catalog,
such as managing and assigning metastores to workspaces and managing other users.
The initial account administrator is designated when setting data picks up for the
first time, but users can be elevated by enabling the admin role for users'
profile.
A service principal is an individual identity for use with automated tools, running
jobs and applications.
While they are assigned a name by the creator, they are uniquely identified by a
global unique identifier, commonly referred to as a keyword,
that is assigned by the platform when the service principal is created.
Service principals authenticate with the platform using an access token, the access
functionality through the APIs, or can run workloads using Databricks' workflows.
Like users, service principals can be elevated to have administrator privileges,
which allows them to programmatically carry out any of the account management tasks
that a user with this role can perform.
Groups are a somewhat universal construct in any governance scheme and are used to
gather individual users into a composite unit to simplify management.
In the context of Databricks, groups collect users and service principals into a
single entity to achieve a simplification.
Any grants given to the group are automatically inherited by all members of the
group.
Like the built-in Databricks roles we talked about earlier, most data governance
programs employ custom roles that define who can access what and how within the
organization.
Groups provide a user management construct that cleanly maps to such roles,
simplifying the implementation of data governance policies.
In this way, permissions can be granted to groups in accordance with your
organization's security policies, and users can be added to groups in accordance
with their roles within the organization.
It's inevitable that users will transition between roles. When that happens, it's
trivial to move or copy users from one group to another.
Likewise, as your governance model evolves and role definitions change, it's easy
to affect those changes on groups.
Making changes like this is significantly more intensive and error-prone if the
system is built with permissions hardwired at the individual user level.
Groups can also be nested within other groups if your security model calls for
that.
In this instance, the outer group, all users, is referred to as a parent group of
the inner two groups, analysts and developers.
In this case, the inner groups automatically inherit all grants given to the parent
group.
Recall that identities for Unity catalog are managed through the account console.
However, identities still exist distinctly in the workspaces to which they're
assigned.
Though account and workspace identities are distinct, they are linked by the
identifiable information common to the two.
For users, this is the email address, and for service principles, this is the
global unique identifier.
Thus, it's important when propagating an account-level user to a workspace that
their email address, as recorded in their workspace identity, matches exactly.
Otherwise, users running workloads would run into trouble as soon as they try to
access any data objects through Unity catalog,
even though they might be able to log in to their assigned workspace.
For this reason, Workspace is support a feature called Identity Federation, which
simplifies the maintenance of account identities across one or more workspaces.
Identity Federation alleviates the need to manually create and maintain copies of
identities at the workspace level.
Identities, that is, users, service principles, groups, and nested groups, are
created once in the account console.
Then, they can be assigned to one or more workspaces as needed.
Assignment is a simple operation that could be done from the account console by an
account administrator, or the Workspace by a Workspace administrator.
Though Workspace identities can still be managed directly from within those
individual workspaces by a Workspace administrator,
Databricks discourages this practice and rather encourages the management of
identities at the account level.
We talked about the Unity catalog security model at a high level.
Let's take a deeper dive into how that security model applies to data objects like
tables and views.
Catalogs and schemas both containers support two privileges, create, and usage.
These form the foundation of Unity catalogs explicit permission model, which means
that all permissions must be explicitly granted and are not implied or inherited.
Create allows a grantee to create child objects.
In the case of a catalog, this means the ability to create schemas within the
catalog.
For schemas, this means the ability to create data objects like tables, views, and
functions.
As for created catalogs, only the Metasore admin can do that.
Usage allows the grantee to traverse the container in order to access child
objects.
To access a table, for example, you need usage on the containing schema and
catalog, as well as appropriate privileges on the table itself.
Without this entire chain of grants in place, access will not be permitted.
To reiterate this important point, privileges are automatically inherited by child
objects.
Granting create privileges on a catalog, for example, automatically propagates that
privilege to schemas within that catalog.
Tables both external and managed support select and modify.
Select allows querying of the table, while modify allows for modification of the
table data through an update or metadata through Alter.
As a reminder, external and managed tables are treated the same from an access
control perspective.
Views, that is read-only queries against one or more tables that are run when you
query the view, support, select.
One subtle but important property regarding views is the fact that when queried,
the query is treated as though it were running as the owner of the view.
This means that users do not need access to the underlying source tables to access
a view.
This provides you with the ability to protect tables using views.
User-defined functions allow you to augment the comprehensive suite of available
SQL functions with custom code of your own
and are packaged as user-defined functions that are managed like other data objects
within schema.
The execute privilege is applied to enable usage of the function.
Storage credentials and external locations support three privileges.
Refiles allows for direct reading from files, write files allows for direct
modification of files, and create table allows a table to be created based on
stored files.
Shares support select only.
Recipients, while treated by the Metasor as a data object, function more like a
principle in the Delta sharing relationship, in that you grant privileges on a
share to a recipient.
To recap, accessing a table, you need the associated privilege on the table itself.
These include select to read from the table and modify to modify the data or
metadata using insert, delete, or Alter.
Table operations also require traversal of the catalog and schema, which in turn
requires usage on those.
This explicit chain of privileges improves security by reducing the likelihood of
unintended privilege escalation.
We all know that views sit in front of table and can encapsulate some fairly
complicated queries to make the system simpler, but they can also be helpful in
protecting sensitive tables, since users who have access to the view do not need
access to the underlying tables.
Access control with views is similar to tables, although only select is supported
since views are viewed only.
Since traversal of containers is still required, you will still need usage on
those.
And in order for the view to work properly, its owner must have appropriate usage
privileges on the tables and queries.
To access a function, you need the execute, along with usage on the containing
schema and catalog.
We've seen that views can sit in front of a table and protect its underlying data.
Databricks extends view functionality to address some additional use cases that
provide sub-table access control at the level of columns and rows.
Dynamic views augment traditional views, with the Databricks provided functions
that target specific users or groups.
Applying these functions to conditionally omit or transform data within your view
definitions allow you to achieve three important use cases.
You can hide columns for specific users or groups.
When querying the view, targeted users will not see values for the protected
columns.
You can omit records for specific users or groups.
When querying the view, targeted users will only see records that didn't get caught
by the filter criteria, and find that you can transform or partially obscure column
values for specific users or groups.
When querying the view, targeted users will only see a transformed version of the
data for protected columns, for example, the domain name of an email address or the
last two digits of an account number.
Of course, one could achieve this by creating a secondary table based on the table
we want to protect, however, this leads to duplication, an increased complexity,
and maintenance challenges.
Creating a new table, view, or function requires create on the schema.
Again, usage is also needed on both the schema and the catalog.
Dropping an object can only be done by the owner, Metastore Admin, or an account
administrator.
Before we close, let's talk a little more about external storage in Unity Catalog.
As mentioned earlier, the security model for storage credentials and external
locations is defined by three privileges.
Create table, which allows for the creation of a table using that location, and
read files, and write files, which allows for direct reading or modifying of the
files in that location.
However, the exact meaning of those privileges is suddenly different for storage
credentials as compared to external locations.
Where does this difference come from? We can reference one storage credential from
many different external locations.
Consider a bucket containing many different directories. We can define an external
location for each of those directories, where the storage credential represents
access to the bucket itself.
Because we might have many external locations defined that use the same storage
credential, Databricks recommends managing access control through external
locations, since that will provide finer-grained access control.
Granting a privilege on a storage credential will implicitly grant that privilege
to any location accessible by that credential.
And finally, let's look at recommendations when it comes to implementing data
architectures with Unity Catalog.
First, let's talk about best practices surrounding how and where you can set up
your Unity Catalog Metastores.
Unity Catalog allows one Metastore per region and only allows you to use that
Metastore in its assigned region.
If you have multiple regions using Databricks, you will need to create a Metastore
for each region.
You can't use a Metastore in different regions, as Unity Catalog offers a low
latency metadata layer.
Such configuration would impact query performance, and as such, it is not supported
or allowed.
Metastores can share data if needed, using Databricks to Databricks Delta Sharing,
a pattern which can also be applied across multiple clouds.
Applying this scheme, you can register tables for Metastores in different regions.
In this example, we're sharing tables from region A with region B.
Keep in mind the following. Since Delta shares are read-only, tables will appear as
read-only in the consuming Metastore, or in this case, Metastore B.
Access control lists do not cross Delta sharing, so access control rules need to be
set up in the destination, in this case, Metastore B.
Apply this scheme sparingly for tables that are infrequently accessed, since you'll
be responsible for egress charges across cloud regions.
If you have frequently accessed data, it may make sense to copy it across regions
by a batch process, and then query locally, as opposed to querying across regions.
It may be tempting to concoct a different approach using external tables, but it's
create a storage credential in Metastore B that directly connects to cloud storage
referenced by the tables in Metastore A.
While this is possible, Databricks strongly advises against this. There was a risk
here, in that any changes to table metadata in one Metastore will not be propagated
to the other Metastore, leading to potential consistency issues.
With the basic rules of Metastores covered, some might be thinking these rules
might be limiting, particularly in terms of segregating your data.
Let's go into more detail on why that actually isn't the case, and how you can use
alternate constructs to achieve more desirable results.
As Metastores are limited to one per region, they are not a suitable construct for
segregating data.
There are a couple of other reasons why it would not be considered good practice to
segregate data using Metastores.
First, switching target Metastores requires workspace reassignment, which messily
spreads data governance tasks across several roles within your organization, such a
scheme would involve Metastore administrators, account administrators, and
potentially workspace administrators.
And second, Metastores are not actually physical containers. They are essentially a
thin layer that references a metadata repository and a cloud storage object.
Using Unity Catalog's container constructs, that is, schemas and catalogs, for
segregating your data enables all this to be handled by the Metastore
administrator.
With appropriate access rules in place, full isolation is provided, and Unity
Catalog's explicit privilege model minimizes the risk of implicit or accidental
privilege escalation.
This simple example shown here illustrates isolation by granting access only to
objects in Catalog B to group B, with a minimum of two privilege grants, usage on
Catalog B, and usage on any applicable schemas within Catalog B.
Using Catalog's to provide segregation across your organization's information
architecture can be done in a number of different ways.
Examples include scoping by environment scope, like dev, staging, and prod, by
team, business unit, or any combination of these things.
Catalogs can also be used to designate sandboxes, that is, an internal-only area to
create temporary data sets for internal use.
In short, use Catalogs to organize your data objects by environment scope, business
unit, or whatever combination of these you need.
Within each Catalog remains a two-level namespace with which most analysts and
single developers will be familiar.
We add usage to applicable Catalogs only to those groups who should have access,
and finally, in short ownership of production catalogs and schemas are on groups,
not individuals.
Every good data governance story starts with a strong identity solution.
With Unity Catalog, we've elevated identities from individual Databricks workspaces
to a new parent component, the account, which has a one-to-one relationship with
your organization.
In order to write any ACL in Unity Catalog, or to provide any level of access or
principle to a secureable object in Unity Catalog, that principle must live within
the account.
In other words, workspace-only identities will not have access to data objects
through Unity Catalog.
To help with this, Databricks automatically promoted all workspace users and
service principles to account-level principles in June of 2022.
Workspace-level groups were not included in this campaign due to potential
membership conflicts.
Moving forward, all organizations are strongly encouraged to manage identities at
the account level, which extends to identity provider integration.
In other words, identity providers, or the SCIM API, should not be used at the
workspace level at all.
With Identity Federation enabled, which is available when Unity Catalog is enabled
on workspaces, users can be easily assigned from accounts to workspaces, either by
account administrators through the account console or workspace administrators
through their workspace administration console.
It is best practice to use groups instead of users for assigning access and
ownership to secureable objects.
Groups should always be synchronized with the system they are managed in via the
SCIM API or through your identity provider.
It is best practice to use service principles to run production jobs.
This way, if the user is removed, there is no unforeseen impact to running jobs.
Furthermore, users should not run jobs that write into production.
This reduces the risk of a user overwriting production data by accident.
The flip side to this is that users should never be granted modified access to
production tables.
This example represents the file system hierarchy of a single cloud storage
container.
Storage credentials provide the same access-level control as external locations,
but apply it to the entire namespace.
It is normally desirable to predicate control over individual paths, which can only
be accomplished using external locations.
Now that we have covered the concept of Unity Catalog, let's actually see how Unity
Catalog in a real environment.
In this example, we have set up a Databricks classroom environment, which will be
able to run UC.
I will be transitioning between an account, which will create and own my data, as
well as be able to assign grants.
And another user who will be part of the group that will be given access to.
This user will be logged into the workspace, but using the Databricks SQL browser
in order to go ahead and run queries.
You'll know which one I'm using when I'm taking a look at the upper right corner to
see either the Class 25 designation or the Class 0 0 designation.
Let's first start as the administrator of this workspace.
So in this environment, we already have cluster running and attached this notebook,
and we've already run the set of scripts to establish the variables we'll need for
this example.
These variables are specific to our printed environments, so you don't have to
worry about using in a real environment.
Just as a reminder, Unity Catalog utilizes a three-level namespace where we talk
about the catalog and then the schema and then the table name.
So we have a virtual name, which is only used a two-levels namespace, but since
Unity Catalog allows us to support multiple catalogs, we added their layer here.
Let's go ahead and create a new catalog and then set that as the default catalog,
so we don't have to keep referencing that throughout the commands we're going to
run this notebook.
Next, we'll go ahead and create a schema using the Create schema if not exists
command.
And then we'll go ahead and set up a table and a view to play with in our
environment here.
Let's first start with the table.
We'll just use a simple creator replace table and then do an insert into to
populate in values for the table to be read from.
We will have a total of five rows of data that we're going to work with initially.
Next, let's create a view.
So we'll do a create a replace view call and select specific pieces of data from
the heart rate device table that we created so that we'll have a view that we can
use as the end user to verify whether or not our permission setting makes any
difference.
Notice in this case, the view actually does an aggregation of the data.
So instead of five rows of data for each individual user, we actually have a total
of three rows given us average heart beats for the users that we're looking at in
this example.
The account users group has already been created for us.
This is the group that all of our users in this classroom have been assigned to.
So we're going to actually play with the ability to grant and revoke access using
this particular use enabled group.
As you mentioned earlier, best practice in Unity catalog is to go ahead and assign
permissions at a group level rather than an individual user level to provide some
user way to maintain consistent security policies as users transition in and out of
roles.
Let's go ahead and uncomment these lines and give permissions to our account users
group.
We're going to grant use it on the catalog, use it on the schema and select on the
actual view we want to have them using.
Remember that select on the object that we want them to view is not enough.
We do need to make sure that they have usage permissions in order to traverse
through the three level name structure to get to the object they want to work with.
And now let's see what happens if this user tries to access this data.
We're going to go ahead and use this command to generate the select statement I
want the user to run.
This way, but they will actually be able to use the same variables that we
configured for this notebook.
All right, so let's take a command and paste it into the SQL editor of this user
that we're working with here.
So let's go ahead and add that in there.
Notice that in order to make sure that we are using the exact view that we want to
use, we're using the three level name space of catalog schema and table name.
All right, so let's go ahead and run that.
And we see that this user now can actually see the same view that we saw in the
other window.
Right, so we currently have not set up anything other than the ability to select
the data from this view.
So we see the exact same thing that the data owner did.
All right, good. So we've enabled access for my account users group.
And now we're going to start seeing how to restrict this access.
So next, before we get into some of the rules here, let's go ahead and also prove
that we can create and grant access to functions in addition to tables and views
within our environment.
So we're going to do a create a replace function called mask mask is going to take
an input of string is going to return a string and it's going to concatenate on
some data so that we can mask out everything for the last two characters of the
data that we provide.
Run this and then do a select on there. We see that if I mask the term sensitive
data, we get back a bunch of asterisk and the letters to last few teachers last few
characters of sensitive data.
In order to make sure that my user can use this, we're going to go ahead and grant
exu function on this to my account users group.
What that run and there we go.
And then let's verify that this works. So let's go ahead and generate a command to
run in my environment.
Opening new query window.
Let's go ahead and get this run.
All right, so there we get the same output that we did before. So we've proven that
the users in the account users group can run the same functions that we've given
them access to.
Next, let's start going into our tables and protecting the columns and rows with
dynamic use.
Then I can use are going to be able to use some functions that make our lives
easier that are usually related.
First off is current user, which will return my current user identity so that using
verify whether or not this user has access rights to the things we're looking at.
So we're going to be able to use this as a count group member, which we'll check to
see if this account that I'm logged in as is part of a group that has permission to
sign to it.
So we'll see what this looks like as we go through here.
There's also another one called is never, but we really don't recommend to use it.
This is pretty much deprecated at this point.
We would prefer that use the more recent versions current user and is account group
member.
How do we use this? What we're going to do is we're going to use it through redec
columns.
So let's go ahead and take a look at this. We're going to rewrite the aggregate
heart rate view as a new new version.
In this case, we're going to use case statements to check for specific condition.
In this case, we're going to see whether or not the user running this command is
part of the account users group.
Then we're going to replace the data with redacted. Otherwise, you'll have to get
the mRN data.
We'll do the exact same thing for the name column, redacting out the name if this
is part of the user that's part of this restricted group.
The rest is just standard SQL, so we won't worry too much about that. But let's go
ahead and see what happens.
So the view is now created, but if you update a view, we also have to update the
grants.
So let's go ahead and reassign the select grant on every heart rate to my account
users.
Let's go look at the SQL editor for our regular user and see what happens.
I'm not going to change this query at all. I'm going to use the exact same query
just as we did before.
Where we are actually just doing a simple select on the every heart rate view.
Again, at this point, we've rewritten the every heart rate view to have the
reductions, and that's it.
We did that and we reissued the grants. From the user side, they don't necessarily
know anything's happened.
Well, at least not until they run the command anyways, because when I run the
command,
now we see that the mRN and a number of redacted no longer back to that information
because the view that we were assigned to use has changed the back end.
All right, that's pretty exciting. What else can we do?
We can still leverage this whole idea of using the account group member tool in
order to go ahead and work with this as well.
So I can go ahead and say create a replace view, aggregate heart rate as select mRN
time device ID heart rate.
That's all the same.
But now I'm going to say where in the case where this user is part of the account
users group,
then we're only going to let you see device IDs less than 30.
All right, so we'll change this.
We'll go ahead and reissue the grants.
And then we'll switch back over to the SQL editor.
Right now notice that we're going to change the view.
Now instead of seeing the aggregate data, we're going to just see the raw data
instead.
But if I run this,
what we see are device IDs, but nothing greater than 30.
We actually went ahead and made that condition part of the view so that this user
is restricted to only see things below 30.
We can also do what's referred to as data masking.
Data masking is one of the reasons why we created that function earlier because I
wanted to be able to run a masking function as part of these checks.
So here we're going to rewrite the aggregate heart rate view one more time.
We're going to go ahead and say now check and see whether the user running this
view is by the account users group.
And if it is the case, then we're going to run the mass function on the mRN value
rather than doing that reaction.
So earlier, you may need to do this for situations where I still need to run as it
statistics on unique identifiers, but I can't give you the whole identifier.
This is that idea where people are using, for example, maybe the last four digits
of the social security number.
We need to be able to have a identifier to identify who this user is, but not give
us access to their complete identity.
All right, so we'll mass the mRN in this case.
And we'll also combine this with our other filter that we set up based on our
membership to make sure that we don't see any device IDs above 30.
So we start that we'll go ahead and reissue the grants.
Then we'll go back to our SQL editor one more time.
Again, not changing the actual select statement and then just run this.
And now we see that the mRN has been redacted out and we are still not seeing any
device IDs greater than 30.
All right, so we can mix and match. We can be very, very creative with the way that
we are restricting data.
So we combine our restriction as well as dynamic masking in a single command.
So let's just take a moment to make sure that we understand where everything is.
If I do a show tables, we see that the aggregate heart rate and heart rate devices
objects both show up in the tables.
The show tables command is a little bit generic. It's going to identify both tables
and views, even though we specifically asked for tables.
If I want to be more selective, I can say show views, which will show me just the
views.
So notice in this case, I get a heart rate, which is a view reading from the heart
rate device table shows up at the table itself does not.
I can also take a look at show schemas show schemas will show me the schemas that
are part of the catalog that I'm currently using.
So we see that we have three databases right now in here default example and
information schema.
Examples of what we've been using for this demonstration.
And I can also see catalogs, which I have access to see.
So I come in here and say show catalogs, we can go ahead and see that these are the
catalogs that I am currently allowed to see within my current workspace.
If I want to then say, all right, well, now that I know what those objects are and
where they are and what names they have, I can go ahead and do things like show
grants.
So show me the grants on the Agri Heart Review.
This will show me currently that account users can do select on this table and it
happens to be this particular object.
So again, catalog name schema table name.
If I want to also take a look at the grants on the heart rate device table, but
remember, we didn't actually put any grants on the heart rate device table.
We didn't have to give you direct access. And this is an important factor with UC.
I can provide a view to a user without giving them the grants on the table itself
so that they can never access the raw data.
They can only see the data that I give them access to through the view I've created
and then access to.
Remember, even that is restricted. I've got to make sure that the schema that
contains the table that I want to give them access to or the view I want to get
them access to.
We have to make sure that we have the use grants assigned to them in order for them
to reverse through that structure and also the same goes for catalog.
So if we don't have the use game and use catalog enabled, it won't matter if I
grant a show at the object level because it just won't be allowed.
So if you want to revoke access, we can do that as well. First, let's show the
grants on the function mask that we created earlier.
We see that the user's group has execute permissions on here.
I can go ahead then and revoke the execute permissions from account users and
verify that this actually took a fact.
All right, so there are no more grants on function masks. This is what we saw
recently. We've revoked it. Now there's nothing there.
I can even revoke usage on the catalog. I'm using revoke usage to make that happen
as well.
But here's the interesting thing. Let's go back over here and run this.
Let's just run this. Notice that I no longer have access to that catalog so I can
no longer get to the data contained within it.
So this is a very, very good safety feature that helps us out here. If I'm really
concerned, I can just get your access to catalog and then at my leisure go through
and adjust the functions and the views on the tables later on.
All right. So with that, that brings us to the end of our UC demo. We have lab
versions you can play around with as well. So if you have time, go ahead and try
that.
All right.
All right.
All right.

You might also like