This document provides an overview of Unity Catalog and how it addresses challenges with data governance. Unity Catalog provides a unified approach to governing data and AI assets across clouds and platforms through centralized access controls, auditing, and management of permissions. It integrates with the Databricks lake house platform to simplify data governance and improve security, access management, and visibility of data usage.
Download as TXT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
171 views
Manage Data Access With Unity Catalog
This document provides an overview of Unity Catalog and how it addresses challenges with data governance. Unity Catalog provides a unified approach to governing data and AI assets across clouds and platforms through centralized access controls, auditing, and management of permissions. It integrates with the Databricks lake house platform to simplify data governance and improve security, access management, and visibility of data usage.
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 17
In this series of videos, we'll teach you key concepts about Unity Catalog,
including how it integrates with the Databricks platform.
How to access Unity Catalog through clusters and SQL warehouses. How to create and govern data assets in Unity Catalog. And finally, we'll review Databricks recommendations for your organization's Unity Catalog based solutions. Let's get started. Before we dive into the specifics of Unity Catalog, let's take a moment to discuss the more general topic of Databricks. 80% of organizations seeking to scale digital businesses will fail because they do not take a modern approach to data and analytics governance. Data governance refers to the process of managing the availability, usability, integrity, and security of data. Data governance has become extremely important in today's data-driven landscape, and it's needed to establish trustworthy data that is readily accessible to those who need it, in accordance with internally defined policies or external regulatory requirements. When we talk about data governance, we can generally divide the discussion into four main areas. First is data access control. Organizations need to be able to lock down data and data generating artifacts such as files, tables, and machine learning models. Permission to access should be granted only to those who need it. Second is data access auditing. Any data governance program requires knowledge of how data is being used. That doesn't just mean who is accessing the data, but when and how it's being used. Third is data lineage. The journey your data takes from its origin as it moves through your pipelines is referred to as data lineage, and being able to trace it is very beneficial in a number of ways. First, it's much easier to identify root causes of processing issues. It enables you to predict the impact of proposed upstream changes to your processing. Third, it helps compliance teams prove that data and reports come from trusted and verifiable sources. And finally, it fuels better understanding of the data for data analysts, scientists, and engineers. And the fourth way, data discovery. Data computing data is always a challenge, particularly in a data lake. But a data governance program requires a readily accessible inventory of data assets that can be easily searched. Since the emergence of the data lake paradigm, data governance in this context has been complex, perhaps unnecessarily so. Many organizations employ a data lake alongside a data warehouse, which leads to data duplicated and hence data drift. Governance happens at two different levels across these platforms, both of which often employ fundamentally different tool stacks. This makes it hard to collaborate and control things consistently. Data and AI governance today is complex. And we're looking to serve all of these different types of technology, and we're trying to serve them, and this creates a really complex environment for governance. In a typical enterprise today, you have a lot of data stored in data lakes, for example, AWS S3. To control permissions on data in the data lake, you set permissions on files and directories. It means you cannot set fine-grained permissions on rows and columns. Because governance controls are at the file level, data teams must carefully structure their data layout to support the desired policies. For example, a team might partition data into different directories by country and give access to each directory to different groups. But what should the team do when governance rules change? If different states inside one country adopt different data regulations, the organization may need to restructure all its directories and files. So it becomes complex to rewrite all the policy changes. With most data lakes, you not only have files, but you also have metadata. For example, you might have a hive metaphor that keeps track of table definitions and views. So you have to get permission on tables and views. This can actually go out of sync with underlying data. There's no guarantee that if users have access permissions on files, then they will have permission on the corresponding table, or vice versa. It becomes very confusing to manage these permissions. Then you might have your data warehouse, where the permissions are more fine- grained on tables, columns, and views. But again, it's a different governance model. In a typical scenario, you have data movement from your data lakes and data warehouses. And now you've created data silos with data movement across two systems, each with a different governance model. As a result, your governance method is inconsistent and prone to errors, making it difficult to manage permissions, conduct audits, or discover and share data. But data isn't limited to files or tables. You also have assets like dashboards, machine learning models, and notebooks, each with its own permission models and tech stack, making it difficult to manage access permissions for all these assets consistently. The problem gets bigger when your data assets exist across multiple clouds with different access management solutions. What you need is a unified approach to simplifying data governance for data and AI. To address these challenges, we created Unity Catalog, which gives a unified governance layer for all data and AI assets in your lake house. Unity Catalog provides a single interface to manage permissions or auditing for your data and AI assets. So how does Unity Catalog rise to these challenges? By integrating a centralized hub within the Databricks lake house platform for administering and securing data and auditing your data. You can unify governance across clouds. Unity Catalog blends in with the Databricks lake house platform and allows you to define your data access rules once, where they can be applied across multiple workspaces, clouds, languages, and use cases, governing all your data assets in a consistent way through easy to use user interfaces and SQL APIs. The access controls available go far beyond what any cloud system can provide. Unity Catalog provides access control for all managed data assets, including files, tables, rows, columns within tables, and views. While Databricks has provided some degree of access control in the past, its security model is permissive by default and required careful administration of the access control lists and the compute resources accessing the data to yield a fully secure solution. The access control rules are defined as a property of a workspace, so reliably scaling this out to a multi workspace or multi cloud environment is a real challenge. Unity Catalog fixes all that. It lives outside the workspace and therefore spans workspaces and clouds. It's secured by default and doesn't rely on the cooperation of compute resources to enforce access control. You can also unify data and AI assets. While your data is free to roam, Unity Catalog can manage it all from a central hub in a consistent and familiar way. There's no need to replicate or translate your security requirements across different systems. If rules change, those changes only need to be applied once. Unity Catalog also brings with it the ability to do fine-grained auditing of all queries performed to support data governance use cases. Data lineage will also be collected and visualized for tables and columns across all languages. And finally, you can unify existing catalogs. Unity Catalog is additive and works seamlessly with other meta stores, including the Legacy Hive meta store that comes with each data. It's a simple operation that can be done whenever you like. Now that we understand some of the benefits Unity Catalog brings, let's look at some of the key elements of Unity Catalog that are important to understanding how Unity Catalog works. First is the meta store. The meta store is the top of the system. The meta store is the top-level logical container in Unity Catalog. It's a construct that represents the metadata, that is, information about the objects being managed by the meta store, as well as the access control lists that govern access to those objects. Don't confuse the Unity Catalog meta store with the Hive meta store. The Hive meta store, which will be familiar to existing Databricks users, is the traditional default meta store linked to each Databricks workspace. And while it may seem functionally similar to a Unity Catalog meta store, Unity Catalog meta stores offer improved security and auditing capabilities, as well as an improved security model and a host of other features. It's best to think of a meta store as a logical construct for organizing your data and its associated metadata, rather than a physical container itself. This theme is reflected in the physical makeup of a meta store, and accepting this idea will also make some of our best practices more sensible, which we'll cover later in this lesson. The meta store essentially functions as a reference or a collection of meta data and a link to a cloud storage container. The metadata, that is, information about the data objects, like a table's columns, data types, for examples, as well as the access control lists for those objects is stored in the control plane. This ties a meta store to a cloud region. This is why account administrators are prompted for a region when creating a meta store. Data related objects managed by the meta store are stored in a cloud storage container, which is also configured as part of the meta store setup. We won't dive much further into these details at the present time, though we'll revisit these concepts throughout the lesson where they apply. An important property of Unity Catalog is that it is additive and does not preclude access to your existing data objects stored in your local meta store. Once a workspace is connected to Unity Catalog through assignment of a Unity Catalog meta store, Unity Catalog presents the Hive meta store as a special catalog named Hive underscore Meta store. Assets within the Hive meta store can be seamlessly referenced by specifying the name Hive underscore Meta store as the first part of the three level name space, or if set as the default catalog, two level references work as they always have. Of course, Unity Catalog will not enforce any access control on assets stored there, though legacy table access control lists are supported in some compute configurations, and we'll talk more about this later. With a high level understanding of what a meta store is, let's now drill into the data object hierarchy, starting with the catalog. A catalog is the top most container for data objects in Unity Catalog and forms the first part of the three level name space that we'll see again and again throughout this module. Before we go any further, let's take a quick moment to talk more about the three level name space, existing Databricks users, and for that matter anyone familiar with SQL, understand the traditional two level name spaces used to address cables within schemas. Unity Catalog introduces this third level to provide improved data segregation capabilities, correspondingly complete SQL references in Unity Catalog use three levels. Getting back to the catalog, meta stores can have as many catalogs as desired. Catalogs are containers, but they can only contain schemas, which we'll talk about next. The concept of a schema, which is sometimes referred to as a database, is part of traditional SQL and is unchanged by Unity Catalog. It functions as a container for data bearing assets like tables and views, and forms the second part of the three level name space we just saw. Catalogs can contain as many schemas as desired, which in turn can contain as many data objects as desired. At the bottom layer of the hierarchy, let's first talk about tables. Existing Databricks users and SQL developers understand tables well. These are SQL relations consisting of an ordered list of columns, and the overall concept of them is unchanged by Unity Catalog. Although tables do have variations in Databricks, and to fully understand these, it's important to recognize that tables are fully defined by two distinct elements. First, the metadata, or the information about the table, including the list of columns and their associated data types. And then we have the data that populates the rows of the table, originating from data files stored in Cloud Object Storage. With these elements in mind, let's now talk about the two table variations that existing Databricks users will likely be familiar with, managed and external tables. In both cases, the table metadata is managed by the Metastore in the control plane. Dropping the table always means discarding the metadata relating to the table. The difference between the variations amounts to where the table data resides. In the case of a managed table, data files are stored in the managed storage location, that is, the Cloud Storage container backing the Metastore that we mentioned previously. With an external table, data files are stored in some other Cloud Storage location supplied by the user. The ability to completely decouple the data like this is important in some use cases and for many of our users. Otherwise, operating on managed versus external tables is largely similar, with the notable exception of dropping the table. In the case of an external table, the underlying table data is left intact. Only the table metadata is discarded. If you were to recreate a table using the same location, the table data will be appeared in the state it was when the table was dropped. Dropping a managed table, on the other hand, discards both the metadata and the data. Views. Existing Databricks users and SQL developers will also likely understand views well. Views are essentially stored queries that are executed when you query the view. Views perform arbitrary SQL transformation of tables or other views, which are read-only. They do not have the ability to modify the underlying data. The final elements in the data object hierarchy are user-defined functions, or UDFs. UDFs enable you to encapsulate custom functionality into a function that can be invoked within queries. We also have the storage credential and external locations. And we'll talk about storage credentials first. Since all data lives in the cloud, Unity Catalog means a way to authenticate with Cloud Storage containers. This is true whether you're using the default storage location configured as part of a Metastore or arbitrary user-supplied Cloud Storage. The storage credential satisfies the requirement of encapsulating an authentication method to access Cloud Storage. Each Metastore has access to external Cloud Storage to support external tables or file access. While storage credentials fulfill a critically important requirement, they're not very convenient in terms of access control when dealing with external Cloud Storage. Storage credentials apply to the entire Cloud Storage container they reference, therefore granting privileges on them applied to the entire container. It's typically desirable to achieve finer-grained control. External locations build on the concept of a storage credential and extend it with a storage path with the Cloud Storage container, allowing users to arbitrarily subdivide containers into smaller pieces and exercise control over each subdivision. External locations can be used to support and manage external tables. They can also be used to govern direct access to files stored in Cloud Storage. And finally, we'll talk about shares and recipients. Shares and recipients relate to Delta Sharing, an open protocol developed by Databricks for secure, low overhead data sharing across organizations. It's intrinsically built into Unity Catalog and is used to explicitly declare shares, that is a read-only logical collection of tables. These can be shared with one or more recipients, that is a data reader outside the organization. We mention these constructs here since they are part of the Unity Catalog security model. And though Delta Sharing will be mentioned here and there, we won't put a heavy focus on it in this module. Databricks Academy offers a separate training focused on Delta Sharing. Let's now talk about the Unity Catalog architecture. So how does Unity Catalog fit into the Databricks landscape? Well, let's talk about life before Unity Catalog. On the left, we see an overview of how things looked prior to the introduction of Unity Catalog. Prior to Unity Catalog, user management and data management were a function of the workspace. Users and groups were defined within a particular workspace, either manually or automatically, through a connection to an identity provider of some sort. But the key point here is that users and groups are ingrained into a single workspace, with no intrinsic synchronization in a multiple workspace environment. Access Control was provided through a cooperative effort between a Metastore, that is the repository that stores information about your data, and the compute resources accessing the data, that is your clusters and single warehouses. By default, this Metastore was furnished by a Hive Metastore local to each workspace, although integrating an external Metastore was also supported. One key element to note here are the locality of the Metastore, which makes it challenging to easily and consistently apply the same set of security control across multiple workspaces. Furthermore, full security requires cooperation between compute resources and the Metastore. Your clusters must be configured to allow access control rules in order for access control to be enforced. If compute resources are not appropriately configured, or users are allowed unrestricted configuration of their own clusters, then access rules can be bypassed. By contrast, Unity Catalog, with its enhanced security model, sits on its own and factors the security-related elements out of the workspace, delivering the defined ones, principal, we discussed earlier. In this architecture, users and groups for Unity Catalog are managed through the Account Console, manually or through an identity provider, and then assigned to one or more workspaces. Metastores are likely factored out of the workspace, and managed through the Account Console, where they can be assigned to workspaces. Any one Metastore can be assigned to more than one workspace, enabling multiple workspaces to share the same access control lists. Compute resources that can connect to Unity Catalog will, by default, be subject to Unity Catalog's security constraints. No specific configuration or administration is needed to make the system secure. Furthermore, any changes to security policies defined in a Metastore are automatically and immediately propagated to all assigned workspaces and their associated clusters or SQL warehouses. Having discussed all the key concepts related to Unity Catalog, let's take a final look at its security model in action. Let's begin by touring the lifecycle of a query to see how Unity Catalog provides access control in a secure yet performing way. The story begins with a principal issuing a query. Queries can be issued through all purpose clusters, for cases when users are running Python or SQL workloads interactively. In the case of a job or pipeline running as a service principal, this would typically run through a job cluster. Alternatively, data analysts may issue queries in Databricks SQL through a SQL warehouse, or the query could be originating from a BI tool connected to a SQL warehouse. In any case, the applicable compute resource begins processing the query. Next, the request is dispatched to Unity Catalog, which in turn logs the request and validates the query against all security constraints defined within the Metastore, to which the compute resource is associated. For each object referenced in the query, Unity Catalog assumes the appropriate Cloud credential governing that object, as provided by a Cloud Administrator. For managed tables, this could be the Cloud Storage associated with the Metastore. For files or external tables, this would be an external location governed by a storage credential. Again, for each object referenced in the query, Unity Catalog generates a scoped temporary token to enable a client to access the data directly from the storage and return that token along with an access URL. That allows the cluster or SQL warehouse to access data directly but securely. The cluster or SQL warehouse requests data directly from Cloud Storage using the URL and token passed back from Unity Catalog. Data is transferred back from Cloud Storage. This request process is repeated for each object referenced by the query. With access to data at the partition level, last mile row or column-based filtering is applied on the cluster or SQL warehouse. And finally, filtered results are passed back to the caller. In this module, we'll discuss considerations related to connecting the two main Databricks compute resources, clusters and SQL warehouses to Unity Catalog. Next, let's talk about compute resources in Unity Catalog. The main distinction Unity Catalog brings to cluster configuration is the introduction of the access mode parameter. Let's talk about that now. All users, regardless of whether or not they have used Databricks before, must be familiar with the new access mode parameter and its available options. For existing Databricks users, the introduction of this new parameter alleviates the previous overloading of the cluster mode parameter, thus eliminating the high concurrency cluster mode. While there are three security mode options to choose from, only two are relevant as far as Unity Catalog is concerned. These are single user and share. As the name suggests, clusters using the single user mode can only be used by a single user, who is designated when creating the cluster. The designated user can also be edited after the fact. The main point here is that this setting is independent of any other cluster access control provided by the workspace, and only the designated user can attach to the cluster. The main advantage of single user clusters are their language support and support for all features that could otherwise compromise the environment of a shared cluster. These features include in its groups, library installation, and DBFS fuse mounts. Note, however, that dynamic use, which we'll talk about later, are not supported currently on single user clusters. Share clusters can be shared by multiple users, but only Python and SQL workloads are allowed. Some advanced cluster features, such as library installation, in its groups, DBFS fuse mounts, are disabled to ensure security isolation among users of the cluster. Notebook-level installations will still work, but cluster installations will not. Do not use the remaining access mode, titled No Isolation Share, if you are running workloads that need to access Unity Catalog. We provided this matrix for reference to aid in selecting an appropriate security mode, with the shaded rows representing modes that support Unity Catalog access. Your choice will be primarily driven by the type of workloads you want to run. Single user clusters are best for automated jobs or general-purpose work when you need features that user isolation doesn't support. Remember that this mode requires a cluster for each user, since only the designated user can attach. For interactive development that relies on Python or SQL, you'll likely want to choose share, unless you need advanced features not supported in that mode. It's also important to note that dynamic use, which we'll discuss in detail later, is only supported in this mode. If you need the row or column protection that dynamic use offer, then you need to choose this option. Let's now talk about roles and identities in Unity Catalog. Cloud administrators have the ability to administer and control the cloud resources that Unity Catalog leverages. These entities have different names depending on which cloud you're working with. But they include storage accounts or buckets, and IAM rules or service principles, or managed identities. As it relates to Unity Catalog, cloud administrators are involved in setting up the resources needed to support new medicines. They're responsible for creating a cloud storage container and setting up so that Databricks can access the container in a secure manner. D-dile responsibilities and tasks associated with this role are cloud-specific and outside the scope of this lesson. Identity administrators have the ability to administer users and groups in the identity provider when one is in use. This service provisions identities at the account level through SCIM connectors, so that identities do not need to be created and managed manually. This setup is performed and maintained by identity administrators and is dependent on the identity provider in use. The exact responsibilities and tasks associated with this role are also outside the scope of the screening. Account administrators have the ability to administer and control anything at the account level. As far as Unity Catalog is concerned, this includes the following tasks. Creating and managing metasource, based on resources set up by a cloud administrator, assigning metasource to workspaces, and managing users or groups, or setting up interaction with an identity provider or automated identity management, a task done in cooperation with an identity administrator. Account administrators also have full access to all data objects with the added ability to grant privileges and change ownerships. The initial account administrator is designated when setting up Databricks for the first time in your organization. Regular users can be elevated to account administrator by an existing account administrator through the setting of an attribute in the user's profile. Every metastore has an administrator too, who by default is the account administrator that created the metastore. The metastore administrator has the ability to create catalogs and other data objects within the metastore they own. They have full access to all data objects within the metastore with the added ability to grant privileges and change ownerships. Essentially, they have the same abilities as account administrators, but only within the metastore they own, where account administrators have those abilities over all metastores. The metastore administrator can be changed by the current metastore administrator or an account administrator. It's best practice to designate a group as a metastore admin rather than an individual, but we'll go into a few more examples. We'll go into more details on this shortly, and finally each data object also has an owner, who by default is the principal who created the object. Data owners have the ability to perform grants on data objects they own, as well as create new nested objects. For example, a schema owner can create a table within that schema. Owners also have the ability to grant privileges and change ownerships over the objects they own. Data ownership can be changed by the owner, the metastore administrator, or any account administrator. Though this final role has little to do with data governance, we'll talk about it quickly to round out the big picture. Workspace administrators have the ability to perform administrative tasks on specific workspaces. These tasks include, administrating permissions on assets and compute resources within the workspace, defining cluster policies that limit users' ability to create their own clusters, adding or removing user assignments, elevating user permissions within a workspace, and changing job ownerships. Workspace administrators can be designated by account administrators when assigning users to the workspace, though a workspace administrator can also elevate a user within the workspace they administer. A user in Databricks corresponds to an individual physical user of the system, that is, a person. Users authenticate using their email and password and interact with the platform through the user interface. Users can also access functionality through command-line tools and REST APIs. User identities in Databricks are fairly straightforward. Users are uniquely identified by email address and can carry additional information, at a minimum, a first and last name, to make identities more readable. As we mentioned previously, account administrators have the ability to perform several administrative tasks important to Unity catalog, such as managing and assigning metastores to workspaces and managing other users. The initial account administrator is designated when setting data picks up for the first time, but users can be elevated by enabling the admin role for users' profile. A service principal is an individual identity for use with automated tools, running jobs and applications. While they are assigned a name by the creator, they are uniquely identified by a global unique identifier, commonly referred to as a keyword, that is assigned by the platform when the service principal is created. Service principals authenticate with the platform using an access token, the access functionality through the APIs, or can run workloads using Databricks' workflows. Like users, service principals can be elevated to have administrator privileges, which allows them to programmatically carry out any of the account management tasks that a user with this role can perform. Groups are a somewhat universal construct in any governance scheme and are used to gather individual users into a composite unit to simplify management. In the context of Databricks, groups collect users and service principals into a single entity to achieve a simplification. Any grants given to the group are automatically inherited by all members of the group. Like the built-in Databricks roles we talked about earlier, most data governance programs employ custom roles that define who can access what and how within the organization. Groups provide a user management construct that cleanly maps to such roles, simplifying the implementation of data governance policies. In this way, permissions can be granted to groups in accordance with your organization's security policies, and users can be added to groups in accordance with their roles within the organization. It's inevitable that users will transition between roles. When that happens, it's trivial to move or copy users from one group to another. Likewise, as your governance model evolves and role definitions change, it's easy to affect those changes on groups. Making changes like this is significantly more intensive and error-prone if the system is built with permissions hardwired at the individual user level. Groups can also be nested within other groups if your security model calls for that. In this instance, the outer group, all users, is referred to as a parent group of the inner two groups, analysts and developers. In this case, the inner groups automatically inherit all grants given to the parent group. Recall that identities for Unity catalog are managed through the account console. However, identities still exist distinctly in the workspaces to which they're assigned. Though account and workspace identities are distinct, they are linked by the identifiable information common to the two. For users, this is the email address, and for service principles, this is the global unique identifier. Thus, it's important when propagating an account-level user to a workspace that their email address, as recorded in their workspace identity, matches exactly. Otherwise, users running workloads would run into trouble as soon as they try to access any data objects through Unity catalog, even though they might be able to log in to their assigned workspace. For this reason, Workspace is support a feature called Identity Federation, which simplifies the maintenance of account identities across one or more workspaces. Identity Federation alleviates the need to manually create and maintain copies of identities at the workspace level. Identities, that is, users, service principles, groups, and nested groups, are created once in the account console. Then, they can be assigned to one or more workspaces as needed. Assignment is a simple operation that could be done from the account console by an account administrator, or the Workspace by a Workspace administrator. Though Workspace identities can still be managed directly from within those individual workspaces by a Workspace administrator, Databricks discourages this practice and rather encourages the management of identities at the account level. We talked about the Unity catalog security model at a high level. Let's take a deeper dive into how that security model applies to data objects like tables and views. Catalogs and schemas both containers support two privileges, create, and usage. These form the foundation of Unity catalogs explicit permission model, which means that all permissions must be explicitly granted and are not implied or inherited. Create allows a grantee to create child objects. In the case of a catalog, this means the ability to create schemas within the catalog. For schemas, this means the ability to create data objects like tables, views, and functions. As for created catalogs, only the Metasore admin can do that. Usage allows the grantee to traverse the container in order to access child objects. To access a table, for example, you need usage on the containing schema and catalog, as well as appropriate privileges on the table itself. Without this entire chain of grants in place, access will not be permitted. To reiterate this important point, privileges are automatically inherited by child objects. Granting create privileges on a catalog, for example, automatically propagates that privilege to schemas within that catalog. Tables both external and managed support select and modify. Select allows querying of the table, while modify allows for modification of the table data through an update or metadata through Alter. As a reminder, external and managed tables are treated the same from an access control perspective. Views, that is read-only queries against one or more tables that are run when you query the view, support, select. One subtle but important property regarding views is the fact that when queried, the query is treated as though it were running as the owner of the view. This means that users do not need access to the underlying source tables to access a view. This provides you with the ability to protect tables using views. User-defined functions allow you to augment the comprehensive suite of available SQL functions with custom code of your own and are packaged as user-defined functions that are managed like other data objects within schema. The execute privilege is applied to enable usage of the function. Storage credentials and external locations support three privileges. Refiles allows for direct reading from files, write files allows for direct modification of files, and create table allows a table to be created based on stored files. Shares support select only. Recipients, while treated by the Metasor as a data object, function more like a principle in the Delta sharing relationship, in that you grant privileges on a share to a recipient. To recap, accessing a table, you need the associated privilege on the table itself. These include select to read from the table and modify to modify the data or metadata using insert, delete, or Alter. Table operations also require traversal of the catalog and schema, which in turn requires usage on those. This explicit chain of privileges improves security by reducing the likelihood of unintended privilege escalation. We all know that views sit in front of table and can encapsulate some fairly complicated queries to make the system simpler, but they can also be helpful in protecting sensitive tables, since users who have access to the view do not need access to the underlying tables. Access control with views is similar to tables, although only select is supported since views are viewed only. Since traversal of containers is still required, you will still need usage on those. And in order for the view to work properly, its owner must have appropriate usage privileges on the tables and queries. To access a function, you need the execute, along with usage on the containing schema and catalog. We've seen that views can sit in front of a table and protect its underlying data. Databricks extends view functionality to address some additional use cases that provide sub-table access control at the level of columns and rows. Dynamic views augment traditional views, with the Databricks provided functions that target specific users or groups. Applying these functions to conditionally omit or transform data within your view definitions allow you to achieve three important use cases. You can hide columns for specific users or groups. When querying the view, targeted users will not see values for the protected columns. You can omit records for specific users or groups. When querying the view, targeted users will only see records that didn't get caught by the filter criteria, and find that you can transform or partially obscure column values for specific users or groups. When querying the view, targeted users will only see a transformed version of the data for protected columns, for example, the domain name of an email address or the last two digits of an account number. Of course, one could achieve this by creating a secondary table based on the table we want to protect, however, this leads to duplication, an increased complexity, and maintenance challenges. Creating a new table, view, or function requires create on the schema. Again, usage is also needed on both the schema and the catalog. Dropping an object can only be done by the owner, Metastore Admin, or an account administrator. Before we close, let's talk a little more about external storage in Unity Catalog. As mentioned earlier, the security model for storage credentials and external locations is defined by three privileges. Create table, which allows for the creation of a table using that location, and read files, and write files, which allows for direct reading or modifying of the files in that location. However, the exact meaning of those privileges is suddenly different for storage credentials as compared to external locations. Where does this difference come from? We can reference one storage credential from many different external locations. Consider a bucket containing many different directories. We can define an external location for each of those directories, where the storage credential represents access to the bucket itself. Because we might have many external locations defined that use the same storage credential, Databricks recommends managing access control through external locations, since that will provide finer-grained access control. Granting a privilege on a storage credential will implicitly grant that privilege to any location accessible by that credential. And finally, let's look at recommendations when it comes to implementing data architectures with Unity Catalog. First, let's talk about best practices surrounding how and where you can set up your Unity Catalog Metastores. Unity Catalog allows one Metastore per region and only allows you to use that Metastore in its assigned region. If you have multiple regions using Databricks, you will need to create a Metastore for each region. You can't use a Metastore in different regions, as Unity Catalog offers a low latency metadata layer. Such configuration would impact query performance, and as such, it is not supported or allowed. Metastores can share data if needed, using Databricks to Databricks Delta Sharing, a pattern which can also be applied across multiple clouds. Applying this scheme, you can register tables for Metastores in different regions. In this example, we're sharing tables from region A with region B. Keep in mind the following. Since Delta shares are read-only, tables will appear as read-only in the consuming Metastore, or in this case, Metastore B. Access control lists do not cross Delta sharing, so access control rules need to be set up in the destination, in this case, Metastore B. Apply this scheme sparingly for tables that are infrequently accessed, since you'll be responsible for egress charges across cloud regions. If you have frequently accessed data, it may make sense to copy it across regions by a batch process, and then query locally, as opposed to querying across regions. It may be tempting to concoct a different approach using external tables, but it's create a storage credential in Metastore B that directly connects to cloud storage referenced by the tables in Metastore A. While this is possible, Databricks strongly advises against this. There was a risk here, in that any changes to table metadata in one Metastore will not be propagated to the other Metastore, leading to potential consistency issues. With the basic rules of Metastores covered, some might be thinking these rules might be limiting, particularly in terms of segregating your data. Let's go into more detail on why that actually isn't the case, and how you can use alternate constructs to achieve more desirable results. As Metastores are limited to one per region, they are not a suitable construct for segregating data. There are a couple of other reasons why it would not be considered good practice to segregate data using Metastores. First, switching target Metastores requires workspace reassignment, which messily spreads data governance tasks across several roles within your organization, such a scheme would involve Metastore administrators, account administrators, and potentially workspace administrators. And second, Metastores are not actually physical containers. They are essentially a thin layer that references a metadata repository and a cloud storage object. Using Unity Catalog's container constructs, that is, schemas and catalogs, for segregating your data enables all this to be handled by the Metastore administrator. With appropriate access rules in place, full isolation is provided, and Unity Catalog's explicit privilege model minimizes the risk of implicit or accidental privilege escalation. This simple example shown here illustrates isolation by granting access only to objects in Catalog B to group B, with a minimum of two privilege grants, usage on Catalog B, and usage on any applicable schemas within Catalog B. Using Catalog's to provide segregation across your organization's information architecture can be done in a number of different ways. Examples include scoping by environment scope, like dev, staging, and prod, by team, business unit, or any combination of these things. Catalogs can also be used to designate sandboxes, that is, an internal-only area to create temporary data sets for internal use. In short, use Catalogs to organize your data objects by environment scope, business unit, or whatever combination of these you need. Within each Catalog remains a two-level namespace with which most analysts and single developers will be familiar. We add usage to applicable Catalogs only to those groups who should have access, and finally, in short ownership of production catalogs and schemas are on groups, not individuals. Every good data governance story starts with a strong identity solution. With Unity Catalog, we've elevated identities from individual Databricks workspaces to a new parent component, the account, which has a one-to-one relationship with your organization. In order to write any ACL in Unity Catalog, or to provide any level of access or principle to a secureable object in Unity Catalog, that principle must live within the account. In other words, workspace-only identities will not have access to data objects through Unity Catalog. To help with this, Databricks automatically promoted all workspace users and service principles to account-level principles in June of 2022. Workspace-level groups were not included in this campaign due to potential membership conflicts. Moving forward, all organizations are strongly encouraged to manage identities at the account level, which extends to identity provider integration. In other words, identity providers, or the SCIM API, should not be used at the workspace level at all. With Identity Federation enabled, which is available when Unity Catalog is enabled on workspaces, users can be easily assigned from accounts to workspaces, either by account administrators through the account console or workspace administrators through their workspace administration console. It is best practice to use groups instead of users for assigning access and ownership to secureable objects. Groups should always be synchronized with the system they are managed in via the SCIM API or through your identity provider. It is best practice to use service principles to run production jobs. This way, if the user is removed, there is no unforeseen impact to running jobs. Furthermore, users should not run jobs that write into production. This reduces the risk of a user overwriting production data by accident. The flip side to this is that users should never be granted modified access to production tables. This example represents the file system hierarchy of a single cloud storage container. Storage credentials provide the same access-level control as external locations, but apply it to the entire namespace. It is normally desirable to predicate control over individual paths, which can only be accomplished using external locations. Now that we have covered the concept of Unity Catalog, let's actually see how Unity Catalog in a real environment. In this example, we have set up a Databricks classroom environment, which will be able to run UC. I will be transitioning between an account, which will create and own my data, as well as be able to assign grants. And another user who will be part of the group that will be given access to. This user will be logged into the workspace, but using the Databricks SQL browser in order to go ahead and run queries. You'll know which one I'm using when I'm taking a look at the upper right corner to see either the Class 25 designation or the Class 0 0 designation. Let's first start as the administrator of this workspace. So in this environment, we already have cluster running and attached this notebook, and we've already run the set of scripts to establish the variables we'll need for this example. These variables are specific to our printed environments, so you don't have to worry about using in a real environment. Just as a reminder, Unity Catalog utilizes a three-level namespace where we talk about the catalog and then the schema and then the table name. So we have a virtual name, which is only used a two-levels namespace, but since Unity Catalog allows us to support multiple catalogs, we added their layer here. Let's go ahead and create a new catalog and then set that as the default catalog, so we don't have to keep referencing that throughout the commands we're going to run this notebook. Next, we'll go ahead and create a schema using the Create schema if not exists command. And then we'll go ahead and set up a table and a view to play with in our environment here. Let's first start with the table. We'll just use a simple creator replace table and then do an insert into to populate in values for the table to be read from. We will have a total of five rows of data that we're going to work with initially. Next, let's create a view. So we'll do a create a replace view call and select specific pieces of data from the heart rate device table that we created so that we'll have a view that we can use as the end user to verify whether or not our permission setting makes any difference. Notice in this case, the view actually does an aggregation of the data. So instead of five rows of data for each individual user, we actually have a total of three rows given us average heart beats for the users that we're looking at in this example. The account users group has already been created for us. This is the group that all of our users in this classroom have been assigned to. So we're going to actually play with the ability to grant and revoke access using this particular use enabled group. As you mentioned earlier, best practice in Unity catalog is to go ahead and assign permissions at a group level rather than an individual user level to provide some user way to maintain consistent security policies as users transition in and out of roles. Let's go ahead and uncomment these lines and give permissions to our account users group. We're going to grant use it on the catalog, use it on the schema and select on the actual view we want to have them using. Remember that select on the object that we want them to view is not enough. We do need to make sure that they have usage permissions in order to traverse through the three level name structure to get to the object they want to work with. And now let's see what happens if this user tries to access this data. We're going to go ahead and use this command to generate the select statement I want the user to run. This way, but they will actually be able to use the same variables that we configured for this notebook. All right, so let's take a command and paste it into the SQL editor of this user that we're working with here. So let's go ahead and add that in there. Notice that in order to make sure that we are using the exact view that we want to use, we're using the three level name space of catalog schema and table name. All right, so let's go ahead and run that. And we see that this user now can actually see the same view that we saw in the other window. Right, so we currently have not set up anything other than the ability to select the data from this view. So we see the exact same thing that the data owner did. All right, good. So we've enabled access for my account users group. And now we're going to start seeing how to restrict this access. So next, before we get into some of the rules here, let's go ahead and also prove that we can create and grant access to functions in addition to tables and views within our environment. So we're going to do a create a replace function called mask mask is going to take an input of string is going to return a string and it's going to concatenate on some data so that we can mask out everything for the last two characters of the data that we provide. Run this and then do a select on there. We see that if I mask the term sensitive data, we get back a bunch of asterisk and the letters to last few teachers last few characters of sensitive data. In order to make sure that my user can use this, we're going to go ahead and grant exu function on this to my account users group. What that run and there we go. And then let's verify that this works. So let's go ahead and generate a command to run in my environment. Opening new query window. Let's go ahead and get this run. All right, so there we get the same output that we did before. So we've proven that the users in the account users group can run the same functions that we've given them access to. Next, let's start going into our tables and protecting the columns and rows with dynamic use. Then I can use are going to be able to use some functions that make our lives easier that are usually related. First off is current user, which will return my current user identity so that using verify whether or not this user has access rights to the things we're looking at. So we're going to be able to use this as a count group member, which we'll check to see if this account that I'm logged in as is part of a group that has permission to sign to it. So we'll see what this looks like as we go through here. There's also another one called is never, but we really don't recommend to use it. This is pretty much deprecated at this point. We would prefer that use the more recent versions current user and is account group member. How do we use this? What we're going to do is we're going to use it through redec columns. So let's go ahead and take a look at this. We're going to rewrite the aggregate heart rate view as a new new version. In this case, we're going to use case statements to check for specific condition. In this case, we're going to see whether or not the user running this command is part of the account users group. Then we're going to replace the data with redacted. Otherwise, you'll have to get the mRN data. We'll do the exact same thing for the name column, redacting out the name if this is part of the user that's part of this restricted group. The rest is just standard SQL, so we won't worry too much about that. But let's go ahead and see what happens. So the view is now created, but if you update a view, we also have to update the grants. So let's go ahead and reassign the select grant on every heart rate to my account users. Let's go look at the SQL editor for our regular user and see what happens. I'm not going to change this query at all. I'm going to use the exact same query just as we did before. Where we are actually just doing a simple select on the every heart rate view. Again, at this point, we've rewritten the every heart rate view to have the reductions, and that's it. We did that and we reissued the grants. From the user side, they don't necessarily know anything's happened. Well, at least not until they run the command anyways, because when I run the command, now we see that the mRN and a number of redacted no longer back to that information because the view that we were assigned to use has changed the back end. All right, that's pretty exciting. What else can we do? We can still leverage this whole idea of using the account group member tool in order to go ahead and work with this as well. So I can go ahead and say create a replace view, aggregate heart rate as select mRN time device ID heart rate. That's all the same. But now I'm going to say where in the case where this user is part of the account users group, then we're only going to let you see device IDs less than 30. All right, so we'll change this. We'll go ahead and reissue the grants. And then we'll switch back over to the SQL editor. Right now notice that we're going to change the view. Now instead of seeing the aggregate data, we're going to just see the raw data instead. But if I run this, what we see are device IDs, but nothing greater than 30. We actually went ahead and made that condition part of the view so that this user is restricted to only see things below 30. We can also do what's referred to as data masking. Data masking is one of the reasons why we created that function earlier because I wanted to be able to run a masking function as part of these checks. So here we're going to rewrite the aggregate heart rate view one more time. We're going to go ahead and say now check and see whether the user running this view is by the account users group. And if it is the case, then we're going to run the mass function on the mRN value rather than doing that reaction. So earlier, you may need to do this for situations where I still need to run as it statistics on unique identifiers, but I can't give you the whole identifier. This is that idea where people are using, for example, maybe the last four digits of the social security number. We need to be able to have a identifier to identify who this user is, but not give us access to their complete identity. All right, so we'll mass the mRN in this case. And we'll also combine this with our other filter that we set up based on our membership to make sure that we don't see any device IDs above 30. So we start that we'll go ahead and reissue the grants. Then we'll go back to our SQL editor one more time. Again, not changing the actual select statement and then just run this. And now we see that the mRN has been redacted out and we are still not seeing any device IDs greater than 30. All right, so we can mix and match. We can be very, very creative with the way that we are restricting data. So we combine our restriction as well as dynamic masking in a single command. So let's just take a moment to make sure that we understand where everything is. If I do a show tables, we see that the aggregate heart rate and heart rate devices objects both show up in the tables. The show tables command is a little bit generic. It's going to identify both tables and views, even though we specifically asked for tables. If I want to be more selective, I can say show views, which will show me just the views. So notice in this case, I get a heart rate, which is a view reading from the heart rate device table shows up at the table itself does not. I can also take a look at show schemas show schemas will show me the schemas that are part of the catalog that I'm currently using. So we see that we have three databases right now in here default example and information schema. Examples of what we've been using for this demonstration. And I can also see catalogs, which I have access to see. So I come in here and say show catalogs, we can go ahead and see that these are the catalogs that I am currently allowed to see within my current workspace. If I want to then say, all right, well, now that I know what those objects are and where they are and what names they have, I can go ahead and do things like show grants. So show me the grants on the Agri Heart Review. This will show me currently that account users can do select on this table and it happens to be this particular object. So again, catalog name schema table name. If I want to also take a look at the grants on the heart rate device table, but remember, we didn't actually put any grants on the heart rate device table. We didn't have to give you direct access. And this is an important factor with UC. I can provide a view to a user without giving them the grants on the table itself so that they can never access the raw data. They can only see the data that I give them access to through the view I've created and then access to. Remember, even that is restricted. I've got to make sure that the schema that contains the table that I want to give them access to or the view I want to get them access to. We have to make sure that we have the use grants assigned to them in order for them to reverse through that structure and also the same goes for catalog. So if we don't have the use game and use catalog enabled, it won't matter if I grant a show at the object level because it just won't be allowed. So if you want to revoke access, we can do that as well. First, let's show the grants on the function mask that we created earlier. We see that the user's group has execute permissions on here. I can go ahead then and revoke the execute permissions from account users and verify that this actually took a fact. All right, so there are no more grants on function masks. This is what we saw recently. We've revoked it. Now there's nothing there. I can even revoke usage on the catalog. I'm using revoke usage to make that happen as well. But here's the interesting thing. Let's go back over here and run this. Let's just run this. Notice that I no longer have access to that catalog so I can no longer get to the data contained within it. So this is a very, very good safety feature that helps us out here. If I'm really concerned, I can just get your access to catalog and then at my leisure go through and adjust the functions and the views on the tables later on. All right. So with that, that brings us to the end of our UC demo. We have lab versions you can play around with as well. So if you have time, go ahead and try that. All right. All right. All right.