Azure Data Analytics
Azure Data Analytics
Get started with Azure Data Lake Analytics using the Azure portal
Quickstarts
Portal Visual Studio Visual Studio Code PowerShell Azure CLI
Reference
Command- Line
PowerShell
Azure CLI
Languages
.NET
Node.js
Python
U-SQL
REST
REST API
What is Azure Data Lake Analytics?
9/24/2018 • 2 minutes to read • Edit Online
Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Instead of deploying,
configuring, and tuning hardware, you write queries to transform your data and extract valuable insights. The
analytics service can handle jobs of any scale instantly by setting the dial for how much power you need. You
only pay for your job when it is running, making it cost-effective.
Dynamic scaling
Data Lake Analytics dynamically provisions resources and lets you do analytics on terabytes to petabytes of data.
You pay only for the processing power used. As you increase or decrease the size of data stored or the amount of
compute resources used, you don’t have to rewrite code.
Develop faster, debug, and optimize smarter using familiar tools
Data Lake Analytics deep integrates with Visual Studio. You can use familiar tools to run, debug, and tune your
code. Visualizations of your U -SQL jobs let you see how your code runs at scale, so you can easily identify
performance bottlenecks and optimize costs.
U -SQL: simple and familiar, powerful, and extensible
Data Lake Analytics includes U -SQL, a query language that extends the familiar, simple, declarative nature of
SQL with the expressive power of C#. The U -SQL language uses the same distributed runtime that powers
Microsoft's internal exabyte-scale data lake. SQL and .NET developers can now process and analyze their data
with the skills they already have.
Integrates seamlessly with your IT investments
Data Lake Analytics uses your existing IT investments for identity, management, and security. This approach
simplifies data governance and makes it easy to extend your current data applications. Data Lake Analytics is
integrated with Active Directory for user management and permissions and comes with built-in monitoring and
auditing.
Affordable and cost effective
Data Lake Analytics is a cost-effective solution for running big data workloads. You pay on a per-job basis when
data is processed. No hardware, licenses, or service-specific support agreements are required. The system
automatically scales up or down as the job starts and completes, so you never pay for more than what you need.
Learn more about controlling costs and saving money.
Works with all your Azure data
Data Lake Analytics works with Azure Data Lake Store for the highest performance, throughput, and
parallelization and works with Azure Storage blobs, Azure SQL Database, Azure Warehouse.
Next steps
Get Started with Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Manage Azure Data Lake Analytics using Azure portal | Azure PowerShell | CLI | Azure .NET SDK | Node.js
How to control costs and save money with Data Lake Analytics
Get started with Azure Data Lake Analytics using
the Azure portal
9/18/2018 • 2 minutes to read • Edit Online
This article describes how to use the Azure portal to create Azure Data Lake Analytics accounts, define jobs in
U -SQL, and submit jobs to the Data Lake Analytics service.
Prerequisites
Before you begin this tutorial, you must have an Azure subscription. See Get Azure free trial.
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
See also
To get started developing U -SQL applications, see Develop U -SQL scripts using Data Lake Tools for Visual
Studio.
To learn U -SQL, see Get started with Azure Data Lake Analytics U -SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Develop U-SQL scripts by using Data Lake Tools for
Visual Studio
3/15/2019 • 3 minutes to read • Edit Online
Azure Data Lake and Stream Analytics Tools include functionality related to two Azure services, Azure Data Lake
Analytics and Azure Stream Analytics. For more information on the Azure Stream Analytics scenarios, see Azure
Stream Analytics tools for Visual Studio.
This article describes how to use Visual Studio to create Azure Data Lake Analytics accounts, define jobs in U -
SQL, and submit jobs to the Data Lake Analytics service. For more information about Data Lake Analytics, see
Azure Data Lake Analytics overview.
IMPORTANT
Microsoft recommends you upgrade to Azure Data Lake Tools for Visual Studio version 2.3.3000.4 or later. The previous
versions are no longer available for download and are now deprecated.
What do I need to do?
1. Check if you are using an earlier version than 2.3.3000.4 of Azure Data Lake Tools for Visual Studio.
2. If your version is an earlier version of 2.3.3000.4, update your Azure Data Lake Tools for Visual Studio by visiting
the download center:
For Visual Studio 2017
For Visual Studio 2013 and 2015
Prerequisites
Visual Studio: All editions except Express are supported.
Visual Studio 2017
Visual Studio 2015
Visual Studio 2013
Microsoft Azure SDK for .NET version 2.7.1 or later. Install it by using the Web platform installer.
A Data Lake Analytics account. To create an account, see Get Started with Azure Data Lake Analytics using
Azure portal.
Next steps
Run U -SQL scripts on your own workstation for testing and debugging
Debug C# code in U -SQL jobs using Azure Data Lake Tools for Visual Studio Code
Use the Azure Data Lake Tools for Visual Studio Code
Use Azure Data Lake Tools for Visual Studio Code
3/6/2019 • 14 minutes to read • Edit Online
In this article, learn how you can use Azure Data Lake Tools for Visual Studio Code (VS Code) to create, test, and
run U -SQL scripts. The information is also covered in the following video:
Prerequisites
Azure Data Lake Tools for VS Code supports Windows, Linux, and macOS. U -SQL local run and local debug works
only in Windows.
Visual Studio Code
For MacOS and Linux:
.NET Core SDK 2.0
Mono 5.2.x
@departments =
SELECT * FROM
(VALUES
(31, "Sales"),
(33, "Engineering"),
(34, "Clerical"),
(35, "Marketing")
) AS
D( DepID, DepName );
OUTPUT @departments
TO "/Output/departments.csv"
USING Outputters.Csv();
The script creates a departments.csv file with some data included in the /output folder.
5. Save the file as myUSQL.usql in the opened folder.
To compile a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Compile Script. The compile results appear in the Output window. You can also right-click a script
file, and then select ADL: Compile Script to compile a U -SQL job. The compilation result appears in the
Output pane.
To submit a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Submit Job. You can also right-click a script file, and then select ADL: Submit Job.
After you submit a U -SQL job, the submission logs appear in the Output window in VS Code. The job view
appears in the right pane. If the submission is successful, the job URL appears too. You can open the job URL in a
web browser to track the real-time job status.
On the job view's SUMMARY tab, you can see the job details. Main functions include resubmit a script, duplicate a
script, and open in the portal. On the job view's DATA tab, you can refer to the input files, output files, and resource
files. Files can be downloaded to the local computer.
NOTE
Azure Data Lake Tools autodetects whether the DLL has any assembly dependencies. The dependencies are displayed in
the JSON file after they're detected.
You can upload your DLL resources (for example, .txt, .png, and .csv) as part of the assembly registration.
Another way to trigger the ADL: Register Assembly (Advanced) command is to right-click the .dll file in File
Explorer.
The following U -SQL code demonstrates how to call an assembly. In the sample, the assembly name is test.
REFERENCE ASSEMBLY [test];
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"Sample/SearchLog.txt"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"Sample/SearchLogtest.txt"
USING Outputters.Tsv();
Use U-SQL local run and local debug for Windows users
U -SQL local run tests your local data and validates your script locally before your code is published to Data Lake
Analytics. You can use the local debug feature to complete the following tasks before your code is submitted to
Data Lake Analytics:
Debug your C# code-behind.
Step through the code.
Validate your script locally.
The local run and local debug feature only works in Windows environments, and is not supported on macOS and
Linux-based operating systems.
For instructions on local run and local debug, see U -SQL local run and local debug with Visual Studio Code.
Connect to Azure
Before you can compile and run U -SQL scripts in Data Lake Analytics, you must connect to your Azure account.
To connect to Azure by using a command
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Login. The login information appears on the lower right.
3. Select Copy & Open to open the login webpage. Paste the code into the box, and then select Continue.
4. Follow the instructions to sign in from the webpage. When you're connected, your Azure account name
appears on the status bar in the lower-left corner of the VS Code window.
NOTE
Data Lake Tools automatically signs you in the next time if you don't sign out.
If your account has two factors enabled, we recommend that you use phone authentication rather than using a PIN.
You can't sign out from the explorer. To sign out, see To connect to Azure by using a command.
A more convenient way to list the relative path is through the shortcut menu.
To list the storage path through the shortcut menu
Right-click the path string and select List Path.
Another way to preview the file is through the shortcut menu on the file's full path or the file's relative path in the
script editor.
Upload a file or folder
1. Right-click the script editor and select Upload File or Upload Folder.
2. Choose one file or multiple files if you selected Upload File, or choose the whole folder if you selected Upload
Folder. Then select Upload.
3. Choose the storage folder in the list, or select Enter a path or Browse from root path. (We're using Enter a
path as an example.)
4. Select your Data Lake Analytics account.
5. Browse to or enter the storage folder path (for example, /output/).
6. Select Choose Current Folder to specify your upload destination.
Another way to upload files to storage is through the shortcut menu on the file's full path or the file's relative path
in the script editor.
You can monitor the upload status.
Download a file
You can download a file by using the command ADL: Download File or ADL: Download File (Advanced).
To download a file through the ADL: Download File (Advanced) command
1. Right-click the script editor, and then select Download File (Advanced).
2. VS Code displays a JSON file. You can enter file paths and download multiple files at the same time.
Instructions are displayed in the Output window. To proceed to download the file or files, save (Ctrl+S ) the
JSON file.
You can right-click the folder node and then use the Refresh and Upload Blob commands on the shortcut
menu.
You can right-click the file node and then use the Preview/Edit, Download, Delete, Create EXTRACT
Script (available only for CSV, TSV, and TXT files), Copy Relative Path, and Copy Full Path commands on
the shortcut menu.
Additional features
Data Lake Tools for VS Code supports the following features:
IntelliSense autocomplete: Suggestions appear in pop-up windows around items like keywords, methods,
and variables. Different icons represent different types of objects:
Scala data type
Complex data type
Built-in UDTs
.NET collection and classes
C# expressions
Built-in C# UDFs, UDOs, and UDAAGs
U -SQL functions
U -SQL windowing functions
IntelliSense autocomplete on Data Lake Analytics metadata: Data Lake Tools downloads the Data
Lake Analytics metadata information locally. The IntelliSense feature automatically populates objects from
the Data Lake Analytics metadata. These objects include the database, schema, table, view, table-valued
function, procedures, and C# assemblies.
IntelliSense error marker: Data Lake Tools underlines editing errors for U -SQL and C#.
Syntax highlights: Data Lake Tools uses colors to differentiate items like variables, keywords, data types,
and functions.
NOTE
We recommend that you upgrade to Azure Data Lake Tools for Visual Studio version 2.3.3000.4 or later. The previous
versions are no longer available for download and are now deprecated.
Next steps
Develop U -SQL with Python, R, and C Sharp for Azure Data Lake Analytics in VS Code
U -SQL local run and local debug with Visual Studio Code
Tutorial: Get started with Azure Data Lake Analytics
Tutorial: Develop U -SQL scripts by using Data Lake Tools for Visual Studio
Get started with Azure Data Lake Analytics using
Azure PowerShell
4/1/2019 • 2 minutes to read • Edit Online
Learn how to use Azure PowerShell to create Azure Data Lake Analytics accounts and then submit and run U -
SQL jobs. For more information about Data Lake Analytics, see Azure Data Lake Analytics overview.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
Before you begin this tutorial, you must have the following information:
An Azure Data Lake Analytics account. See Get started with Data Lake Analytics.
A workstation with Azure PowerShell. See How to install and configure Azure PowerShell.
Log in to Azure
This tutorial assumes you are already familiar with using Azure PowerShell. In particular, you need to know how
to log in to Azure. See the Get started with Azure PowerShell if you need help.
To log in with a subscription name:
Instead of the subscription name, you can also use a subscription id to log in:
If successful, the output of this command looks like the following text:
Environment : AzureCloud
Account : joe@contoso.com
TenantId : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
SubscriptionId : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
SubscriptionName : ContosoSubscription
CurrentStorageAccount :
$script = @"
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"@
Submit the script text with the Submit-AdlJob cmdlet and the -Script parameter.
As an alternative, you can submit a script file using the -ScriptPath parameter:
$filename = "d:\test.usql"
$script | out-File $filename
$job = Submit-AdlJob -Account $adla -Name "My Job" -ScriptPath $filename
Instead of calling Get-AdlJob over and over until a job finishes, use the Wait-AdlJob cmdlet.
See also
To see the same tutorial using other tools, click the tab selectors on the top of the page.
To learn U -SQL, see Get started with Azure Data Lake Analytics U -SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Get started with Azure Data Lake Analytics using
Azure CLI
3/5/2019 • 4 minutes to read • Edit Online
This article describes how to use the Azure CLI command-line interface to create Azure Data Lake Analytics
accounts, submit USQL jobs, and catalogs. The job reads a tab separated values (TSV ) file and converts it into a
comma-separated values (CSV ) file.
Prerequisites
Before you begin, you need the following items:
An Azure subscription. See Get Azure free trial.
This article requires that you are running the Azure CLI version 2.0 or later. If you need to install or upgrade,
see Install Azure CLI.
Log in to Azure
To log in to your Azure subscription:
azurecli
az login
You are requested to browse to a URL, and enter an authentication code. And then follow the instructions to enter
your credentials.
Once you have logged in, the login command lists your subscriptions.
To use a specific subscription:
az group list
az dls account create --account "<Data Lake Store Account Name>" --resource-group "<Resource Group Name>"
az dla account create --account "<Data Lake Analytics Account Name>" --resource-group "<Resource Group Name>"
--location "<Azure location>" --default-data-lake-store "<Default Data Lake Store Account Name>"
After creating an account, you can use the following commands to list the accounts and show account details:
az dls fs upload --account "<Data Lake Store Account Name>" --source-path "<Source File Path>" --destination-
path "<Destination File Path>"
az dls fs list --account "<Data Lake Store Account Name>" --path "<Path>"
Data Lake Analytics can also access Azure Blob storage. For uploading data to Azure Blob storage, see Using the
Azure CLI with Azure Storage.
This U -SQL script reads the source data file using Extractors.Tsv(), and then creates a csv file using
Outputters.Csv().
Don't modify the two paths unless you copy the source file into a different location. Data Lake Analytics creates
the output folder if it doesn't exist.
It is simpler to use relative paths for files stored in default Data Lake Store accounts. You can also use absolute
paths. For example:
adl://<Data LakeStorageAccountName>.azuredatalakestore.net:443/Samples/Data/SearchLog.tsv
You must use absolute paths to access files in linked Storage accounts. The syntax for files stored in linked Azure
Storage account is:
wasb://<BlobContainerName>@<StorageAccountName>.blob.core.windows.net/Samples/Data/SearchLog.tsv
NOTE
Azure Blob container with public blobs are not supported.
Azure Blob container with public containers are not supported.
To submit jobs
Use the following syntax to submit a job.
az dla job submit --account "<Data Lake Analytics Account Name>" --job-name "<Job Name>" --script "<Script
Path and Name>"
For example:
azurecli
az dla job list --account "<Data Lake Analytics Account Name>"
az dla job show --account "<Data Lake Analytics Account Name>" --job-identity "<Job Id>"
To cancel jobs
az dla job cancel --account "<Data Lake Analytics Account Name>" --job-identity "<Job Id>"
Retrieve job results
After a job is completed, you can use the following commands to list the output files, and download the files:
az dls fs list --account "<Data Lake Store Account Name>" --source-path "/Output" --destination-path "
<Destination>"
az dls fs preview --account "<Data Lake Store Account Name>" --path "/Output/SearchLog-from-Data-Lake.csv"
az dls fs preview --account "<Data Lake Store Account Name>" --path "/Output/SearchLog-from-Data-Lake.csv" --
length 128 --offset 0
az dls fs download --account "<Data Lake Store Account Name>" --source-path "/Output/SearchLog-from-Data-
Lake.csv" --destination-path "<Destination Path and File Name>"
For example:
Next steps
To see the Data Lake Analytics Azure CLI reference document, see Data Lake Analytics.
To see the Data Lake Store Azure CLI reference document, see Data Lake Store.
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
Manage Azure Data Lake Analytics using the Azure
portal
12/1/2018 • 5 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
the Azure portal.
NOTE
If a user or a security group needs to submit jobs, they also need permission on the store account. For more information,
see Secure data stored in Data Lake Store.
Manage jobs
Submit a job
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click New Job. For each job, configure:
a. Job Name: The name of the job.
b. Priority: Lower numbers have higher priority. If two jobs are queued, the one with lower priority value
runs first.
c. Parallelism: The maximum number of compute processes to reserve for this job.
3. Click Submit Job.
Monitor jobs
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click View All Jobs. A list of all the active and recently finished jobs in the account is shown.
3. Optionally, click Filter to help you find the jobs by Time Range, Job Name, and Author values.
Monitoring pipeline jobs
Jobs that are part of a pipeline work together, usually sequentially, to accomplish a specific scenario. For
example, you can have a pipeline that cleans, extracts, transforms, aggregates usage for customer insights.
Pipeline jobs are identified using the "Pipeline" property when the job was submitted. Jobs scheduled using
ADF V2 will automatically have this property populated.
To view a list of U -SQL jobs that are part of pipelines:
1. In the Azure portal, go to your Data Lake Analytics accounts.
2. Click Job Insights. The "All Jobs" tab will be defaulted, showing a list of running, queued, and ended jobs.
3. Click the Pipeline Jobs tab. A list of pipeline jobs will be shown along with aggregated statistics for each
pipeline.
Monitoring recurring jobs
A recurring job is one that has the same business logic but uses different input data every time it runs. Ideally,
recurring jobs should always succeed, and have relatively stable execution time; monitoring these behaviors will
help ensure the job is healthy. Recurring jobs are identified using the "Recurrence" property. Jobs scheduled
using ADF V2 will automatically have this property populated.
To view a list of U -SQL jobs that are recurring:
1. In the Azure portal, go to your Data Lake Analytics accounts.
2. Click Job Insights. The "All Jobs" tab will be defaulted, showing a list of running, queued, and ended jobs.
3. Click the Recurring Jobs tab. A list of recurring jobs will be shown along with aggregated statistics for each
recurring job.
Next steps
Overview of Azure Data Lake Analytics
Manage Azure Data Lake Analytics by using Azure PowerShell
Manage Azure Data Lake Analytics using policies
Manage Azure Data Lake Analytics using the Azure
Command-line Interface (CLI)
3/15/2019 • 4 minutes to read • Edit Online
Learn how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using the Azure CLI. To
see management topics using other tools, click the tab select above.
Prerequisites
Before you begin this tutorial, you must have the following resources:
An Azure subscription. See Get Azure free trial.
Azure CLI. See Install and configure Azure CLI.
Download and install the pre-release Azure CLI tools in order to complete this demo.
Authenticate by using the az login command and select the subscription that you want to use. For more
information on authenticating using a work or school account, see Connect to an Azure subscription from
the Azure CLI.
az login
az account set --subscription <subscription id>
You can now access the Data Lake Analytics and Data Lake Store commands. Run the following command
to list the Data Lake Store and Data Lake Analytics commands:
az dls -h
az dla -h
Manage accounts
Before running any Data Lake Analytics jobs, you must have a Data Lake Analytics account. Unlike Azure
HDInsight, you don't pay for an Analytics account when it is not running a job. You only pay for the time when it
is running a job. For more information, see Azure Data Lake Analytics Overview.
Create accounts
Run the following command to create a Data Lake account,
az dla account create --account "<Data Lake Analytics account name>" --location "<Location Name>" --resource-
group "<Resource Group Name>" --default-data-lake-store "<Data Lake Store account name>"
Update accounts
The following command updates the properties of an existing Data Lake Analytics Account
az dla account update --account "<Data Lake Analytics Account Name>" --firewall-state "Enabled" --query-
store-retention 7
List accounts
List Data Lake Analytics accounts within a specific resource group
Delete an account
az dla account delete --account "<Data Lake Analytics account name>" --resource-group "<Resource group name>"
az dla account blob-storage add --access-key "<Azure Storage Account Key>" --account "<Data Lake Analytics
account name>" --storage-account-name "<Storage account name>"
NOTE
Only Blob storage short names are supported. Don't use FQDN, for example "myblob.blob.core.windows.net".
az dla account data-lake-store add --account "<Data Lake Analytics account name>" --data-lake-store-account-
name "<Data Lake Store account name>"
az dla account data-lake-store list --account "<Data Lake Analytics account name>"
az dla account blob-storage list --account "<Data Lake Analytics account name>"
az dla account data-lake-store delete --account "<Data Lake Analytics account name>" --data-lake-store-
account-name "<Azure Data Lake Store account name>"
az dla account blob-storage delete --account "<Data Lake Analytics account name>" --storage-account-name "
<Data Lake Store account name>"
Manage jobs
You must have a Data Lake Analytics account before you can create a job. For more information, see Manage
Data Lake Analytics accounts.
List jobs
az dla job show --account "<Data Lake Analytics account name>" --job-identity "<Job Id>"
Submit jobs
NOTE
The default priority of a job is 1000, and the default degree of parallelism for a job is 1.
az dla job submit --account "<Data Lake Analytics account name>" --job-name "<Name of your job>" --script
"<Script to submit>"
Cancel jobs
Use the list command to find the job id, and then use cancel to cancel the job.
az dla job cancel --account "<Data Lake Analytics account name>" --job-identity "<Job Id>"
az dla job pipeline list --account "<Data Lake Analytics Account Name>"
az dla job pipeline show --account "<Data Lake Analytics Account Name>" --pipeline-identity "<Pipeline ID>"
Use the az dla job recurrence commands to see the recurrence information for previously submitted jobs.
az dla job recurrence list --account "<Data Lake Analytics Account Name>"
az dla job recurrence show --account "<Data Lake Analytics Account Name>" --recurrence-identity "<Recurrence
ID>"
See also
Overview of Microsoft Azure Data Lake Analytics
Get started with Data Lake Analytics using Azure portal
Manage Azure Data Lake Analytics using Azure portal
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Manage Azure Data Lake Analytics using Azure
PowerShell
2/13/2019 • 8 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
Azure PowerShell.
Prerequisites
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which
will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install
Azure PowerShell.
To use PowerShell with Data Lake Analytics, collect the following pieces of information:
Subscription ID: The ID of the Azure subscription that contains your Data Lake Analytics account.
Resource group: The name of the Azure resource group that contains your Data Lake Analytics account.
Data Lake Analytics account name: The name of your Data Lake Analytics account.
Default Data Lake Store account name: Each Data Lake Analytics account has a default Data Lake Store
account.
Location: The location of your Data Lake Analytics account, such as "East US 2" or other supported
locations.
The PowerShell snippets in this tutorial use these variables to store this information
$subId = "<SubscriptionId>"
$rg = "<ResourceGroupName>"
$adla = "<DataLakeAnalyticsAccountName>"
$adls = "<DataLakeStoreAccountName>"
$location = "<Location>"
Log in to Azure
Log in using interactive user authentication
Log in using a subscription ID or by subscription name
# Using subscription id
Connect-AzAccount -SubscriptionId $subId
$tenantid = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
$spi_appname = "appname"
$spi_appid = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
$spi_secret = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
Manage accounts
List accounts
Create an account
Every Data Lake Analytics account requires a default Data Lake Store account that it uses for storing logs. You
can reuse an existing account or create an account.
# Create a data lake store if needed, or you can re-use an existing one
New-AdlStore -ResourceGroupName $rg -Name $adls -Location $location
New-AdlAnalyticsAccount -ResourceGroupName $rg -Name $adla -Location $location -DefaultDataLake $adls
You can find the default Data Lake Store account by filtering the list of datasources by the IsDefault property:
$script = @"
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"@
$scriptpath = "d:\test.usql"
$script | Out-File $scriptpath
List jobs
The output includes the currently running jobs and those jobs that have recently completed.
Cancel a job
Manage files
Check for the existence of a file.
Download a file.
Resolve-AzError -Last
Find a TenantID
From a subscription name:
Get-TenantIdFromSubscriptionName "ADLTrainingMS"
$subid = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
Get-TenantIdFromSubscriptionId $subid
$domain = "contoso.com"
Get-TenantIdFromDomain $domain
$subs = Get-AzSubscription
foreach ($sub in $subs)
{
Write-Host $sub.Name "(" $sub.Id ")"
Write-Host "`tTenant Id" $sub.TenantId
}
Next steps
Overview of Microsoft Azure Data Lake Analytics
Get started with Data Lake Analytics using the Azure portal | Azure PowerShell | Azure CLI
Manage Azure Data Lake Analytics using Azure portal | Azure PowerShell | CLI
Manage Azure Data Lake Analytics a .NET app
4/1/2019 • 7 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure .NET SDK.
Prerequisites
Visual Studio 2015, Visual Studio 2013 update 4, or Visual Studio 2012 with Visual C++ Installed.
Microsoft Azure SDK for .NET version 2.5 or above. Install it using the Web platform installer.
Required NuGet Packages
Install NuGet packages
PACKAGE VERSION
Microsoft.Rest.ClientRuntime.Azure.Authentication 2.3.1
Microsoft.Azure.Management.DataLake.Analytics 3.0.0
Microsoft.Azure.Management.DataLake.Store 2.2.0
Microsoft.Azure.Management.ResourceManager 1.6.0-preview
Microsoft.Azure.Graph.RBAC 3.4.0-preview
You can install these packages via the NuGet command line with the following commands:
Common variables
string subid = "<Subscription ID>"; // Subscription ID (a GUID)
string tenantid = "<Tenant ID>"; // AAD tenant ID or domain. For example, "contoso.onmicrosoft.com"
string rg == "<value>"; // Resource group name
string clientid = "1950a258-227b-4e31-a9cf-717495945fc2"; // Sample client ID (this will work, but you should
pick your own)
Authentication
You have multiple options for logging on to Azure Data Lake Analytics. The following snippet shows an example
of authentication with interactive user authentication with a pop-up.
using System;
using System.IO;
using System.Threading;
using System.Security.Cryptography.X509Certificates;
using Microsoft.Rest;
using Microsoft.Rest.Azure.Authentication;
using Microsoft.Azure.Management.DataLake.Analytics;
using Microsoft.Azure.Management.DataLake.Analytics.Models;
using Microsoft.Azure.Management.DataLake.Store;
using Microsoft.Azure.Management.DataLake.Store.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Azure.Graph.RBAC;
The source code for GetCreds_User_Popup and the code for other options for authentication are covered in
Data Lake Analytics .NET authentication options
Manage accounts
Create an Azure Resource Group
If you haven't already created one, you must have an Azure Resource Group to create your Data Lake Analytics
components. You need your authentication credentials, subscription ID, and a location. The following code shows
how to create a resource group:
For more information, see Azure Resource Groups and Data Lake Analytics.
Create a Data Lake Store account
Ever ADLA account requires an ADLS account. If you don't already have one to use, you can create one with the
following code:
Delete an account
if (adlaClient.Account.Exists(rg, adla))
{
adlaClient.Account.Delete(rg, adla);
}
if (adlaClient.Account.Exists(rg, adla))
{
var adla_accnt = adlaClient.Account.Get(rg, adla);
string def_adls_account = adla_accnt.DefaultDataLakeStoreAccount;
}
if (stg_accounts != null)
{
foreach (var stg_account in stg_accounts)
{
Console.WriteLine($"Storage account: {0}", stg_account.Name);
}
}
if (adls_accounts != null)
{
foreach (var adls_accnt in adls_accounts)
{
Console.WriteLine($"ADLS account: {0}", adls_accnt.Name);
}
}
memstream.Position = 0;
List pipelines
The following code lists information about each pipeline of jobs submitted to the account.
var pipelines = adlaJobClient.Pipeline.List(adla);
foreach (var p in pipelines)
{
Console.WriteLine($"Pipeline: {p.Name}\t{p.PipelineId}\t{p.LastSubmitTime}");
}
List recurrences
The following code lists information about each recurrence of jobs submitted to the account.
Next steps
Overview of Microsoft Azure Data Lake Analytics
Manage Azure Data Lake Analytics using Azure portal
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Manage Azure Data Lake Analytics using Python
3/27/2019 • 4 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs by using
Python.
Authentication
Interactive user authentication with a pop-up
This method is not supported.
Interactive user authentication with a device code
user = input('Enter the user to authenticate with that has permission to subscription: ')
password = getpass.getpass()
credentials = UserPassCredentials(user, password)
adlsAcctResult = adlsAcctClient.account.create(
rg,
adls,
DataLakeStoreAccount(
location=location)
)
).wait()
adlaAcctResult = adlaAcctClient.account.create(
rg,
adla,
DataLakeAnalyticsAccount(
location=location,
default_data_lake_store_account=adls,
data_lake_store_accounts=[DataLakeStoreAccountInformation(name=adls)]
)
).wait()
Submit a job
script = """
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0),
("Woodgrove", 2700.0)
) AS
D( customer, amount );
OUTPUT @a
TO "/data.csv"
USING Outputters.Csv();
"""
jobId = str(uuid.uuid4())
jobResult = adlaJobClient.job.create(
adla,
jobId,
JobInformation(
name='Sample Job',
type='USql',
properties=USqlJobProperties(script=script)
)
)
pipelines = adlaJobClient.pipeline.list(adla)
for p in pipelines:
print('Pipeline: ' + p.name + ' ' + p.pipelineId)
recurrences = adlaJobClient.recurrence.list(adla)
for r in recurrences:
print('Recurrence: ' + r.name + ' ' + r.recurrenceId)
userAadObjectId = "3b097601-4912-4d41-b9d2-78672fc2acde"
newPolicyParams = ComputePolicyCreateOrUpdateParameters(userAadObjectId, "User", 50, 250)
adlaAccountClient.computePolicies.createOrUpdate(rg, adla, "GaryMcDaniel", newPolicyParams)
Next steps
To see the same tutorial using other tools, click the tab selectors on the top of the page.
To learn U -SQL, see Get started with Azure Data Lake Analytics U -SQL language.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
Manage Azure Data Lake Analytics using a Java app
11/7/2018 • 5 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure Java SDK.
Prerequisites
Java Development Kit (JDK) 8 (using Java version 1.8).
IntelliJ or another suitable Java development environment. The instructions in this document use IntelliJ.
Create an Azure Active Directory (AAD ) application and retrieve its Client ID, Tenant ID, and Key. For more
information about AAD applications and instructions on how to get a client ID, see Create Active Directory
application and service principal using portal. The Reply URI and Key is available from the portal once you
have the application created and key generated.
Go to File > Settings > Build > Execution > Deployment. Select Build Tools > Maven > Importing. Then
check Import Maven projects automatically.
Open Main.java and replace the existing code block with the following code snippet:
package com.company;
import com.microsoft.azure.CloudException;
import com.microsoft.azure.credentials.ApplicationTokenCredentials;
import com.microsoft.azure.management.datalake.store.*;
import com.microsoft.azure.management.datalake.store.models.*;
import com.microsoft.azure.management.datalake.analytics.*;
import com.microsoft.azure.management.datalake.analytics.models.*;
import com.microsoft.rest.credentials.ServiceClientCredentials;
import java.io.*;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.UUID;
import java.util.List;
_adlsAccountName = "<DATA-LAKE-STORE-NAME>";
_adlaAccountName = "<DATA-LAKE-ANALYTICS-NAME>";
_resourceGroupName = "<RESOURCE-GROUP-NAME>";
_location = "East US 2";
_tenantId = "<TENANT-ID>";
_subId = "<SUBSCRIPTION-ID>";
_clientId = "<CLIENT-ID>";
_clientSecret = "<CLIENT-SECRET>";
// ----------------------------------------
// Authenticate
// ----------------------------------------
ApplicationTokenCredentials creds = new ApplicationTokenCredentials(_clientId, _tenantId,
_clientSecret, null);
SetupClients(creds);
// ----------------------------------------
// List Data Lake Store and Analytics accounts that this app can access
// ----------------------------------------
System.out.println(String.format("All ADL Store accounts that this app can access in subscription
%s:", _subId));
List<DataLakeStoreAccount> adlsListResult = _adlsClient.getAccountOperations().list().getBody();
for (DataLakeStoreAccount acct : adlsListResult) {
System.out.println(acct.getName());
}
System.out.println(String.format("All ADL Analytics accounts that this app can access in subscription
%s:", _subId));
List<DataLakeAnalyticsAccount> adlaListResult = _adlaClient.getAccountOperations().list().getBody();
for (DataLakeAnalyticsAccount acct : adlaListResult) {
System.out.println(acct.getName());
}
WaitForNewline("Accounts displayed.", "Creating files.");
// ----------------------------------------
// Create a file in Data Lake Store: input1.csv
// ----------------------------------------
// ----------------------------------------
// Submit a job to Data Lake Analytics
// ----------------------------------------
string script = "@input = EXTRACT Data string FROM \"/input1.csv\" USING Extractors.Csv(); OUTPUT @input TO
@\"/output1.csv\" USING Outputters.Csv();", "testJob";
UUID jobId = SubmitJobByScript(script);
WaitForNewline("Job submitted.", "Getting job status.");
// ----------------------------------------
// Wait for job completion and output job status
// ----------------------------------------
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
System.out.println("Waiting for job completion.");
WaitForJob(jobId);
System.out.println(String.format("Job status: %s", GetJobStatus(jobId)));
WaitForNewline("Job completed.", "Downloading job output.");
// ----------------------------------------
// Download job output from Data Lake Store
// ----------------------------------------
DownloadFile("/output1.csv", localFolderPath + "output1.csv");
WaitForNewline("Job output downloaded.", "Deleting file.");
}
}
Provide the values for parameters called out in the code snippet:
localFolderPath
_adlaAccountName
_adlsAccountName
_resourceGroupName
Helper functions
Setup clients
if (!nextAction.isEmpty())
{
System.out.println(nextAction);
}
}
Create accounts
Create a file
public static void CreateFile(String path, String contents, boolean force) throws IOException, CloudException
{
byte[] bytesContents = contents.getBytes();
Delete a file
public static void DeleteFile(String filePath) throws IOException, CloudException
{
_adlsFileSystemClient.getFileSystemOperations().delete(filePath, _adlsAccountName);
}
Download a file
public static void DownloadFile(String srcPath, String destPath) throws IOException, CloudException
{
InputStream stream = _adlsFileSystemClient.getFileSystemOperations().open(srcPath,
_adlsAccountName).getBody();
pWriter.println(fileContents);
pWriter.close();
}
return jobId;
}
Next steps
To learn U -SQL, see Get started with Azure Data Lake Analytics U -SQL language, and U -SQL language
reference.
For management tasks, see Manage Azure Data Lake Analytics using Azure portal.
To get an overview of Data Lake Analytics, see Azure Data Lake Analytics overview.
Manage Azure Data Lake Analytics using Azure SDK
for Node.js
1/2/2019 • 2 minutes to read • Edit Online
This article describes how to manage Azure Data Lake Analytics accounts, data sources, users, and jobs using an
app written using the Azure SDK for Node.js.
The following versions are supported:
Node.js version: 0.10.0 or higher
REST API version for Account: 2015-10-01-preview
REST API version for Catalog: 2015-10-01-preview
REST API version for Job: 2016-03-20-preview
Features
Account management: create, get, list, update, and delete.
Job management: submit, get, list, and cancel.
Catalog management: get and list.
How to Install
npm install azure-arm-datalake-analytics
// A Data Lake Store account must already have been created to create
// a Data Lake Analytics account. See the Data Lake Store readme for
// information on doing so. For now, we assume one exists already.
var datalakeStoreAccountName = 'existingadlsaccount';
See also
Microsoft Azure SDK for Node.js
Microsoft Azure SDK for Node.js - Data Lake Store Management
Adding a user in the Azure portal
9/13/2018 • 2 minutes to read • Edit Online
Optionally, add the user to the Azure Data Lake Storage Gen1 role
Reader role.
1. Find your Azure Data Lake Storage Gen1 account.
2. Click on Users.
3. Click Add.
4. Select an Azure RBAC Role to assign this group.
5. Assign to Reader role. This role has the minimum set of permissions required to browse/manage data stored in
ADLSGen1. Assign to this role if the Group is not intended for managing Azure services.
6. Type in the name of the Group.
7. Click OK.
Next steps
Overview of Azure Data Lake Analytics
Get started with Data Lake Analytics by using the Azure portal
Manage Azure Data Lake Analytics by using Azure PowerShell
Manage Azure Data Lake Analytics using policies
8/27/2018 • 4 minutes to read • Edit Online
Using account policies, you can control how resources an Azure Data Lake Analytics account are used. These
policies allow you to control the cost of using Azure Data Lake Analytics. For example, with these policies you can
prevent unexpected cost spikes by limiting how many AUs the account can simultaneously use.
Account-level policies
These policies apply to all jobs in a Data Lake Analytics account.
Maximum number of AUs in a Data Lake Analytics account
A policy controls the total number of Analytics Units (AUs) your Data Lake Analytics account can use. By default,
the value is set to 250. For example, if this value is set to 250 AUs, you can have one job running with 250 AUs
assigned to it, or 10 jobs running with 25 AUs each. Additional jobs that are submitted are queued until the
running jobs are finished. When running jobs are finished, AUs are freed up for the queued jobs to run.
To change the number of AUs for your Data Lake Analytics account:
1. In the Azure portal, go to your Data Lake Analytics account.
2. Click Properties.
3. Under Maximum AUs, move the slider to select a value, or enter the value in the text box.
4. Click Save.
NOTE
If you need more than the default (250) AUs, in the portal, click Help+Support to submit a support request. The number of
AUs available in your Data Lake Analytics account can be increased.
NOTE
If you need to run more than the default (20) number of jobs, in the portal, click Help+Support to submit a support
request. The number of jobs that can run simultaneously in your Data Lake Analytics account can be increased.
Job-level policies
With job-level policies, you can control the maximum AUs and the maximum priority that individual users (or
members of specific security groups) can set on jobs that they submit. This policy lets you control the costs
incurred by users. It also lets you control the effect that scheduled jobs might have on high-priority production jobs
that are running in the same Data Lake Analytics account.
Data Lake Analytics has two policies that you can set at the job level:
AU limit per job: Users can only submit jobs that have up to this number of AUs. By default, this limit is the
same as the maximum AU limit for the account.
Priority: Users can only submit jobs that have a priority lower than or equal to this value. A higher number
indicates a lower priority. By default, this limit is set to 1, which is the highest possible priority.
There is a default policy set on every account. The default policy applies to all users of the account. You can set
additional policies for specific users and groups.
NOTE
Account-level policies and job-level policies apply simultaneously.
Next steps
Overview of Azure Data Lake Analytics
Get started with Data Lake Analytics by using the Azure portal
Manage Azure Data Lake Analytics by using Azure PowerShell
Configure user access to job information to job
information in Azure Data Lake Analytics
8/27/2018 • 2 minutes to read • Edit Online
In Azure Data Lake Analytics, you can use multiple user accounts or service principals to run jobs.
In order for those same users to see the detailed job information, the users need to be able to read the contents of
the job folders. The job folders are located in /system/ directory.
If the necessary permissions are not configured, the user may see an error:
Graph data not available - You don't have permissions to access the graph data.
Next steps
Add a new user
Accessing diagnostic logs for Azure Data Lake
Analytics
2/27/2019 • 5 minutes to read • Edit Online
Diagnostic logging allows you to collect data access audit trails. These logs provide information such as:
A list of users that accessed the data.
How frequently the data is accessed.
How much data is stored in the account.
Enable logging
1. Sign on to the Azure portal.
2. Open your Data Lake Analytics account and select Diagnostic logs from the Monitor section. Next, select
Turn on diagnostics.
3. From Diagnostics settings, enter a Name for this logging configuration and then select logging options.
You can choose to store/process the data in three different ways.
Select Archive to a storage account to store logs in an Azure storage account. Use this
option if you want to archive the data. If you select this option, you must provide an Azure
storage account to save the logs to.
Select Stream to an Event Hub to stream log data to an Azure Event Hub. Use this option if
you have a downstream processing pipeline that is analyzing incoming logs in real time. If you
select this option, you must provide the details for the Azure Event Hub you want to use.
Select Send to Log Analytics to send the data to the Azure Monitor service. Use this option
if you want to use Azure Monitor logs to gather and analyze logs.
Specify whether you want to get audit logs or request logs or both. A request log captures every API
request. An audit log records all operations that are triggered by that API request.
For Archive to a storage account, specify the number of days to retain the data.
Click Save.
NOTE
You must select either Archive to a storage account, Stream to an Event Hub or Send to Log Analytics
before clicking the Save button.
resourceId=/
SUBSCRIPTIONS/
<<SUBSCRIPTION_ID>>/
RESOURCEGROUPS/
<<RESOURCE_GRP_NAME>>/
PROVIDERS/
MICROSOFT.DATALAKEANALYTICS/
ACCOUNTS/
<DATA_LAKE_ANALYTICS_NAME>>/
y=####/
m=##/
d=##/
h=##/
m=00/
PT1H.json
NOTE
The ## entries in the path contain the year, month, day, and hour in which the log was created. Data Lake Analytics
creates one file every hour, so m= always contains a value of 00 .
https://adllogs.blob.core.windows.net/insights-logs-requests/resourceId=/SUBSCRIPTIONS/<sub-
id>/RESOURCEGROUPS/myresourcegroup/PROVIDERS/MICROSOFT.DATALAKEANALYTICS/ACCOUNTS/mydatalakeanalytics/y
=2016/m=07/d=18/h=14/m=00/PT1H.json
Log structure
The audit and request logs are in a structured JSON format.
Request logs
Here's a sample entry in the JSON -formatted request log. Each blob has one root object called records that
contains an array of log objects.
{
"records":
[
. . . .
,
{
"time": "2016-07-07T21:02:53.456Z",
"resourceId":
"/SUBSCRIPTIONS/<subscription_id>/RESOURCEGROUPS/<resource_group_name>/PROVIDERS/MICROSOFT.DATALAKEANALYTICS/A
CCOUNTS/<data_lake_analytics_account_name>",
"category": "Requests",
"operationName": "GetAggregatedJobHistory",
"resultType": "200",
"callerIpAddress": "::ffff:1.1.1.1",
"correlationId": "4a11c709-05f5-417c-a98d-6e81b3e29c58",
"identity": "1808bd5f-62af-45f4-89d8-03c5e81bac30",
"properties": {
"HttpMethod":"POST",
"Path":"/JobAggregatedHistory",
"RequestContentLength":122,
"ClientRequestId":"3b7adbd9-3519-4f28-a61c-bd89506163b8",
"StartTime":"2016-07-07T21:02:52.472Z",
"EndTime":"2016-07-07T21:02:53.456Z"
}
}
,
. . . .
]
}
Audit logs
Here's a sample entry in the JSON -formatted audit log. Each blob has one root object called records that contains
an array of log objects.
{
"records":
[
. . . .
,
{
"time": "2016-07-28T19:15:16.245Z",
"resourceId":
"/SUBSCRIPTIONS/<subscription_id>/RESOURCEGROUPS/<resource_group_name>/PROVIDERS/MICROSOFT.DATALAKEANALYTICS/A
CCOUNTS/<data_lake_ANALYTICS_account_name>",
"category": "Audit",
"operationName": "JobSubmitted",
"identity": "user@somewhere.com",
"properties": {
"JobId":"D74B928F-5194-4E6C-971F-C27026C290E6",
"JobName": "New Job",
"JobRuntimeName": "default",
"SubmitTime": "7/28/2016 7:14:57 PM"
}
}
,
. . . .
]
}
NOTE
resultType and resultSignature provide information on the result of an operation, and only contain a value if an operation
has completed. For example, they only contain a value when operationName contains a value of JobStarted or JobEnded.
JobName String The name that was provided for the job
SubmitTime String The time (in UTC) that the job was
submitted
NOTE
SubmitTime, StartTime, EndTime, and Parallelism provide information on an operation. These entries only contain a
value if that operation has started or completed. For example, SubmitTime only contains a value after operationName has
the value JobSubmitted.
Next steps
Overview of Azure Data Lake Analytics
Adjust quotas and limits in Azure Data Lake Analytics
8/27/2018 • 2 minutes to read • Edit Online
Learn how to adjust and increase the quota and limits in Azure Data Lake Analytics (ADLA) accounts. Knowing
these limits may help you understand your U -SQL job behavior. All quota limits are soft, so you can increase the
maximum limits by contacting Azure support.
Next steps
Overview of Microsoft Azure Data Lake Analytics
Manage Azure Data Lake Analytics using Azure PowerShell
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Disaster recovery guidance for Azure Data Lake
Analytics
6/4/2019 • 2 minutes to read • Edit Online
Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Instead of deploying,
configuring, and tuning hardware, you write queries to transform your data and extract valuable insights. The
analytics service can handle jobs of any scale instantly by setting the dial for how much power you need. You only
pay for your job when it is running, making it cost-effective. This article provides guidance on how to protect your
jobs from rare region-wide outages or accidental deletions.
NOTE
Since account names are globally unique, use a consistent naming scheme that indicates which account is secondary.
2. For unstructured data, reference Disaster recovery guidance for data in Azure Data Lake Storage Gen1
3. For structured data stored in ADLA tables and databases, create copies of the metadata artifacts such as
databases, tables, table-valued functions, and assemblies. You need to periodically resync these artifacts
when changes happen in production. For example, newly inserted data has to be replicated to the secondary
region by copying the data and inserting into the secondary table.
NOTE
These object names are scoped to the secondary account and are not globally unique, so they can have the same
names as in the primary production account.
During an outage, you need to update your scripts so the input paths point to the secondary endpoint. Then the
users submit their jobs to the ADLA account in the secondary region. The output of the job will then be written to
the ADLA and ADLS account in the secondary region.
Next steps
Disaster recovery guidance for data in Azure Data Lake Storage Gen1
Get started with U-SQL in Azure Data Lake
Analytics
4/9/2019 • 4 minutes to read • Edit Online
U -SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale.
Through the scalable, distributed-query capability of U -SQL, you can efficiently analyze data across relational
stores such as Azure SQL Database. With U -SQL, you can process unstructured data by applying schema on
read and inserting custom logic and UDFs. Additionally, U -SQL includes extensibility that gives you fine-grained
control over how to execute at scale.
Learning resources
The U -SQL Tutorial provides a guided walkthrough of most of the U -SQL language. This document is
recommended reading for all developers wanting to learn U -SQL.
For detailed information about the U -SQL language syntax, see the U -SQL Language Reference.
To understand the U -SQL design philosophy, see the Visual Studio blog post Introducing U -SQL – A
Language that makes Big Data Processing Easy.
Prerequisites
Before you go through the U -SQL samples in this document, read and complete Tutorial: Develop U -SQL
scripts using Data Lake Tools for Visual Studio. That tutorial explains the mechanics of using U -SQL with Azure
Data Lake Tools for Visual Studio.
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT @searchlog
TO "/output/SearchLog-first-u-sql.csv"
USING Outputters.Csv();
This script doesn't have any transformation steps. It reads from the source file called SearchLog.tsv , schematizes
it, and writes the rowset back into a file called SearchLog-first-u-sql.csv.
Notice the question mark next to the data type in the Duration field. It means that the Duration field could be
null.
Key concepts
Rowset variables: Each query expression that produces a rowset can be assigned to a variable. U -SQL
follows the T-SQL variable naming pattern ( @searchlog , for example) in the script.
The EXTRACT keyword reads data from a file and defines the schema on read. Extractors.Tsv is a built-in
U -SQL extractor for tab-separated-value files. You can develop custom extractors.
The OUTPUT writes data from a rowset to a file. Outputters.Csv() is a built-in U -SQL outputter to create a
comma-separated-value file. You can develop custom outputters.
File paths
The EXTRACT and OUTPUT statements use file paths. File paths can be absolute or relative:
This following absolute file path refers to a file in a Data Lake Store named mystore :
adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv
This following file path starts with "/" . It refers to a file in the default Data Lake Store account:
/output/SearchLog-first-u-sql.csv
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM @in
USING Extractors.Tsv();
OUTPUT @searchlog
TO @out
USING Outputters.Csv();
Transform rowsets
Use SELECT to transform rowsets:
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @rs1
TO "/output/SearchLog-transform-rowsets.csv"
USING Outputters.Csv();
The WHERE clause uses a C# Boolean expression. You can use the C# expression language to do your own
expressions and functions. You can even perform more complex filtering by combining them with logical
conjunctions (ANDs) and disjunctions (ORs).
The following script uses the DateTime.Parse() method and a conjunction.
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");
OUTPUT @rs1
TO "/output/SearchLog-transform-datetime.csv"
USING Outputters.Csv();
NOTE
The second query is operating on the result of the first rowset, which creates a composite of the two filters. You can also
reuse a variable name, and the names are scoped lexically.
Aggregate rowsets
U -SQL gives you the familiar ORDER BY, GROUP BY, and aggregations.
The following query finds the total duration per region, and then displays the top five durations in order.
U -SQL rowsets do not preserve their order for the next query. Thus, to order an output, you need to add
ORDER BY to the OUTPUT statement:
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region;
@res =
SELECT *
FROM @rs1
ORDER BY TotalDuration DESC
FETCH 5 ROWS;
OUTPUT @rs1
TO @out1
ORDER BY TotalDuration DESC
USING Outputters.Csv();
OUTPUT @res
TO @out2
ORDER BY TotalDuration DESC
USING Outputters.Csv();
The U -SQL ORDER BY clause requires using the FETCH clause in a SELECT expression.
The U -SQL HAVING clause can be used to restrict the output to groups that satisfy the HAVING condition:
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int?,
Urls string,
ClickedUrls string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region
HAVING SUM(Duration) > 200;
OUTPUT @res
TO "/output/Searchlog-having.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
For advanced aggregation scenarios, see the U -SQL reference documentation for aggregate, analytic, and
reference functions
Next steps
Overview of Microsoft Azure Data Lake Analytics
Develop U -SQL scripts by using Data Lake Tools for Visual Studio
Get started with the U-SQL Catalog in Azure Data
Lake Analytics
1/18/2019 • 2 minutes to read • Edit Online
Create a TVF
In the previous U -SQL script, you repeated the use of EXTRACT to read from the same source file. With the U -SQL
table-valued function (TVF ), you can encapsulate the data for future reuse.
The following script creates a TVF called Searchlog() in the default database and schema:
The following script shows you how to use the TVF that was defined in the previous script:
@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM Searchlog() AS S
GROUP BY Region
HAVING SUM(Duration) > 200;
OUTPUT @res
TO "/output/SearchLog-use-tvf.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
Create views
If you have a single query expression, instead of a TVF you can use a U -SQL VIEW to encapsulate that expression.
The following script creates a view called SearchlogView in the default database and schema:
@res =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM SearchlogView
GROUP BY Region
HAVING SUM(Duration) > 200;
OUTPUT @res
TO "/output/Searchlog-use-view.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
Create tables
As with relational database tables, with U -SQL you can create a table with a predefined schema or create a table
that infers the schema from the query that populates the table (also known as CREATE TABLE AS SELECT or
CTAS ).
Create a database and two tables by using the following script:
DROP DATABASE IF EXISTS SearchLogDb;
CREATE DATABASE SearchLogDb;
USE DATABASE SearchLogDb;
Query tables
You can query tables, such as those created in the previous script, in the same way that you query the data files.
Instead of creating a rowset by using EXTRACT, you now can refer to the table name.
To read from the tables, modify the transform script that you used previously:
@rs1 =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM SearchLogDb.dbo.SearchLog2
GROUP BY Region;
@res =
SELECT *
FROM @rs1
ORDER BY TotalDuration DESC
FETCH 5 ROWS;
OUTPUT @res
TO "/output/Searchlog-query-table.csv"
ORDER BY TotalDuration DESC
USING Outputters.Csv();
NOTE
Currently, you cannot run a SELECT on a table in the same script as the one where you created the table.
Next Steps
Overview of Microsoft Azure Data Lake Analytics
Develop U -SQL scripts using Data Lake Tools for Visual Studio
Monitor and troubleshoot Azure Data Lake Analytics jobs using Azure portal
Develop U-SQL user-defined operators (UDOs)
4/9/2019 • 2 minutes to read • Edit Online
This article describes how to develop user-defined operators to process data in a U -SQL job.
namespace USQL_UDO
{
public class CountryName : IProcessor
{
private static IDictionary<string, string> CountryTranslation = new Dictionary<string, string>
{
{
"Deutschland", "Germany"
},
{
"Suisse", "Switzerland"
},
{
"UK", "United Kingdom"
},
{
"USA", "United States of America"
},
{
"中国", "PR China"
}
};
if (CountryTranslation.Keys.Contains(Country))
{
Country = CountryTranslation[Country];
}
output.Set<string>(0, UserID);
output.Set<string>(1, Name);
output.Set<string>(2, Address);
output.Set<string>(3, City);
output.Set<string>(4, State);
output.Set<string>(5, PostalCode);
output.Set<string>(6, Country);
output.Set<string>(7, Phone);
return output.AsReadOnly();
}
}
}
@drivers_CountryName =
PROCESS @drivers
PRODUCE UserID string,
Name string,
Address string,
City string,
State string,
PostalCode string,
Country string,
Phone string
USING new USQL_UDO.CountryName();
OUTPUT @drivers_CountryName
TO "/Samples/Outputs/Drivers.csv"
USING Outputters.Csv(Encoding.Unicode);
See also
Extending U -SQL Expressions with User-Code
Use Data Lake Tools for Visual Studio for developing U -SQL applications
Extend U-SQL scripts with Python code in Azure Data
Lake Analytics
8/27/2018 • 2 minutes to read • Edit Online
Prerequisites
Before you begin, ensure the Python extensions are installed in your Azure Data Lake Analytics account.
Navigate to you Data Lake Analytics Account in the Azure portal
In the left menu, under GETTING STARTED click on Sample Scripts
Click Install U -SQL Extensions then OK
Overview
Python Extensions for U -SQL enable developers to perform massively parallel execution of Python code. The
following example illustrates the basic steps:
Use the REFERENCE ASSEMBLY statement to enable Python extensions for the U -SQL Script
Using the REDUCE operation to partition the input data on a key
The Python extensions for U -SQL include a built-in reducer ( Extension.Python.Reducer ) that runs Python code
on each vertex assigned to the reducer
The U -SQL script contains the embedded Python code that has a function called usqlml_main that accepts a
pandas DataFrame as input and returns a pandas DataFrame as output.
--
REFERENCE ASSEMBLY [ExtPython];
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
";
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS
D( date, time, author, tweet );
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer(pyScript:@myScript);
OUTPUT @m
TO "/tweetmentions.csv"
USING Outputters.Csv();
pandas
numpy
numexpr
Exception Messages
Currently, an exception in Python code shows up as generic vertex failure. In the future, the U -SQL Job error
messages will display the Python exception message.
Input and Output size limitations
Every vertex has a limited amount of memory assigned to it. Currently, that limit is 6 GB for an AU. Because the
input and output DataFrames must exist in memory in the Python code, the total size for the input and output
cannot exceed 6 GB.
See also
Overview of Microsoft Azure Data Lake Analytics
Develop U -SQL scripts using Data Lake Tools for Visual Studio
Using U -SQL window functions for Azure Data Lake Analytics jobs
Use Azure Data Lake Tools for Visual Studio Code
Extend U-SQL scripts with R code in Azure Data Lake
Analytics
4/2/2019 • 4 minutes to read • Edit Online
The following example illustrates the basic steps for deploying R code:
Use the REFERENCE ASSEMBLY statement to enable R extensions for the U -SQL Script.
Use the REDUCE operation to partition the input data on a key.
The R extensions for U -SQL include a built-in reducer ( Extension.R.Reducer ) that runs R code on each vertex
assigned to the reducer.
Usage of dedicated named data frames called inputFromUSQL and outputToUSQL respectively to pass data
between U -SQL and R. Input and output DataFrame identifier names are fixed (that is, users cannot change
these predefined names of input and output DataFrame identifiers).
Keep the R code in a separate file and reference it the U-SQL script
The following example illustrates a more complex usage. In this case, the R code is deployed as a RESOURCE that
is the U -SQL script.
Save this R code as a separate file.
load("my_model_LM_Iris.rda")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval="confidence"))
Use a U -SQL script to deploy that R script with the DEPLOY RESOURCE statement.
REFERENCE ASSEMBLY [ExtR];
@InputData =
EXTRACT
SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT
Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
SepalLength,
SepalWidth,
PetalLength,
PetalWidth
FROM @InputData;
// Predict Species
base
boot
Class
Cluster
codetools
compiler
datasets
doParallel
doRSR
foreach
foreign
Graphics
grDevices
grid
iterators
KernSmooth
lattice
MASS
Matrix
Methods
mgcv
nlme
Nnet
Parallel
pkgXMLBuilder
RevoIOQ
revoIpe
RevoMods
RevoPemaR
RevoRpeConnector
RevoRsrConnector
RevoScaleR
RevoTreeView
RevoUtils
RevoUtilsMath
Rpart
RUnit
spatial
splines
Stats
stats4
survival
Tcltk
Tools
translations
utils
XML
// R script to run
DECLARE @myRScript = @"
# install the magrittr package,
install.packages('magrittr_1.5.zip', repos = NULL),
# load the magrittr package,
require(magrittr),
# demonstrate use of the magrittr package,
2 %>% sqrt
";
@InputData =
EXTRACT SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT 0 AS Par,
*
FROM @InputData;
Next Steps
Overview of Microsoft Azure Data Lake Analytics
Develop U -SQL scripts using Data Lake Tools for Visual Studio
Using U -SQL window functions for Azure Data Lake Analytics jobs
Get started with the Cognitive capabilities of U-SQL
4/9/2019 • 2 minutes to read • Edit Online
Overview
Cognitive capabilities for U -SQL enable developers to use put intelligence in their big data programs.
The following samples using cognitive capabilities are available:
Imaging: Detect faces
Imaging: Detect emotion
Imaging: Detect objects (tagging)
Imaging: OCR (optical character recognition)
Text: Key Phrase Extraction & Sentiment Analysis
Next steps
U -SQL/Cognitive Samples
Develop U -SQL scripts using Data Lake Tools for Visual Studio
Using U -SQL window functions for Azure Data Lake Analytics jobs
U-SQL programmability guide
4/9/2019 • 41 minutes to read • Edit Online
U -SQL is a query language that's designed for big data-type of workloads. One of the unique features of U -SQL is
the combination of the SQL -like declarative language with the extensibility and programmability that's provided by
C#. In this guide, we concentrate on the extensibility and programmability of the U -SQL language that's enabled
by C#.
Requirements
Download and install Azure Data Lake Tools for Visual Studio.
@a =
SELECT * FROM
(VALUES
("Contoso", 1500.0, "2017-03-39"),
("Woodgrove", 2700.0, "2017-04-10")
) AS D( customer, amount, date );
@results =
SELECT
customer,
amount,
date
FROM @a;
This script defines two RowSets: @a and @results . RowSet @results is defined from @a .
@results =
SELECT
customer,
amount,
DateTime.Parse(date) AS date
FROM @a;
DECLARE @d = DateTime.Parse("2016/01/01");
@rs1 =
SELECT
Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt,
dt AS olddt
FROM @rs0;
OUTPUT @rs1
TO @output_file
USING Outputters.Text();
@rs1 =
SELECT
MAX(guid) AS start_id,
MIN(dt) AS start_time,
MIN(Convert.ToDateTime(Convert.ToDateTime(dt<@default_dt?@default_dt:dt).ToString("yyyy-MM-dd"))) AS
start_zero_time,
MIN(USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)) AS start_fiscalperiod,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
user,
des
FROM @rs0
GROUP BY user, des;
Consult the assembly registration instructions that covers this topic in greater detail.
Use assembly versioning
Currently, U -SQL uses the .NET Framework version 4.5. So ensure that your own assemblies are compatible with
that version of the runtime.
As mentioned earlier, U -SQL runs code in a 64-bit (x64) format. So make sure that your code is compiled to run on
x64. Otherwise you get the incorrect format error shown earlier.
Each uploaded assembly DLL and resource file, such as a different runtime, a native assembly, or a config file, can
be at most 400 MB. The total size of deployed resources, either via DEPLOY RESOURCE or via references to
assemblies and their additional files, cannot exceed 3 GB.
Finally, note that each U -SQL database can only contain one version of any given assembly. For example, if you
need both version 7 and version 8 of the NewtonSoft Json.NET library, you need to register them in two different
databases. Furthermore, each script can only refer to one version of a given assembly DLL. In this respect, U -SQL
follows the C# assembly management and versioning semantics.
int FiscalQuarter=0;
if (FiscalMonth >=1 && FiscalMonth<=3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace USQL_Programmability
{
public class CustomFunctions
{
public static string GetFiscalPeriod(DateTime dt)
{
int FiscalMonth=0;
if (dt.Month < 7)
{
FiscalMonth = dt.Month + 6;
}
else
{
FiscalMonth = dt.Month - 6;
}
int FiscalQuarter=0;
if (FiscalMonth >=1 && FiscalMonth<=3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
Now we are going to call this function from the base U -SQL script. To do this, we have to provide a fully qualified
name for the function, including the namespace, which in this case is NameSpace.Class.Function(parameter).
USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)
@rs0 =
EXTRACT
guid Guid,
dt DateTime,
user String,
des String
FROM @input_file USING Extractors.Tsv();
@rs1 =
SELECT
MAX(guid) AS start_id,
MIN(dt) AS start_time,
MIN(Convert.ToDateTime(Convert.ToDateTime(dt<@default_dt?@default_dt:dt).ToString("yyyy-MM-dd"))) AS
start_zero_time,
MIN(USQL_Programmability.CustomFunctions.GetFiscalPeriod(dt)) AS start_fiscalperiod,
user,
des
FROM @rs0
GROUP BY user, des;
OUTPUT @rs1
TO @output_file
USING Outputters.Text();
0d8b9630-d5ca-11e5-8329-251efa3a2941,2016-02-11T07:04:17.2630000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User1",""
20843640-d771-11e5-b87b-8b7265c75a44,2016-02-11T07:04:17.2630000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User2",""
301f23d2-d690-11e5-9a98-4b4f60a1836f,2016-02-11T09:01:33.9720000-08:00,2016-06-
01T00:00:00.0000000,"Q3:8","User3",""
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace USQLApplication21
{
public class UserSession
{
static public string globalSession;
static public string StampUserSession(string eventTime, string PreviousRow, string Session)
{
if (!string.IsNullOrEmpty(PreviousRow))
{
double timeGap =
Convert.ToDateTime(eventTime).Subtract(Convert.ToDateTime(PreviousRow)).TotalMinutes;
if (timeGap <= 60) {return Session;}
else {return Guid.NewGuid().ToString();}
}
else {return Guid.NewGuid().ToString();}
}
}
This example shows the global variable static public string globalSession; used inside the getStampUserSession
function and getting reinitialized each time the Session parameter is changed.
The U -SQL base script is as follows:
DECLARE @in string = @"\UserSession\test1.tsv";
DECLARE @out1 string = @"\UserSession\Out1.csv";
DECLARE @out2 string = @"\UserSession\Out2.csv";
DECLARE @out3 string = @"\UserSession\Out3.csv";
@records =
EXTRACT DataId string,
EventDateTime string,
UserName string,
UserSessionTimestamp string
FROM @in
USING Extractors.Tsv();
@rs1 =
SELECT
EventDateTime,
UserName,
LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC) AS prevDateTime,
string.IsNullOrEmpty(LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC)) AS Flag,
USQLApplication21.UserSession.StampUserSession
(
EventDateTime,
LAG(EventDateTime, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC),
LAG(UserSessionTimestamp, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC)
) AS UserSessionTimestamp
FROM @records;
@rs2 =
SELECT
EventDateTime,
UserName,
LAG(EventDateTime, 1)
OVER(PARTITION BY UserName ORDER BY EventDateTime ASC) AS prevDateTime,
string.IsNullOrEmpty( LAG(EventDateTime, 1) OVER(PARTITION BY UserName ORDER BY EventDateTime ASC)) AS
Flag,
USQLApplication21.UserSession.getStampUserSession(UserSessionTimestamp) AS UserSessionTimestamp
FROM @rs1
WHERE UserName != "UserName";
OUTPUT @rs2
TO @out2
ORDER BY UserName, EventDateTime ASC
USING Outputters.Csv();
This example demonstrates a more complicated use-case scenario in which we use a global variable inside a code-
behind section that's applied to the entire memory rowset.
NOTE
U-SQL’s built-in extractors and outputters currently cannot serialize or de-serialize UDT data to or from files even with the
IFormatter set. So when you're writing UDT data to a file with the OUTPUT statement, or reading it with an extractor, you
have to pass it as a string or byte array. Then you call the serialization and deserialization code (that is, the UDT’s ToString()
method) explicitly. User-defined extractors and outputters, on the other hand, can read and write UDTs.
If we try to use UDT in EXTRACTOR or OUTPUTTER (out of previous SELECT), as shown here:
@rs1 =
SELECT
MyNameSpace.Myfunction_Returning_UDT(filed1) AS myfield
FROM @rs0;
OUTPUT @rs1
TO @output_file
USING Outputters.Text();
Description:
Resolution:
Implement a custom outputter that knows how to serialize this type, or call a serialization method on the type
in
the preceding SELECT. C:\Users\sergeypu\Documents\Visual Studio 2013\Projects\USQL-Programmability\
USQL-Programmability\Types.usql 52 1 USQL-Programmability
To work with UDT in outputter, we either have to serialize it to string with the ToString() method or create a custom
outputter.
UDTs currently cannot be used in GROUP BY. If UDT is used in GROUP BY, the following error is thrown:
Description:
Resolution:
Add a SELECT statement where you can project a scalar column that you want to use with GROUP BY.
C:\Users\sergeypu\Documents\Visual Studio 2013\Projects\USQL-Programmability\USQL-Programmability\Types.usql
62 5 USQL-Programmability
using Microsoft.Analytics.Interfaces
using System.IO;
Add Microsoft.Analytics.Interfaces , which is required for the UDT interfaces. In addition, System.IO might
be needed to define the IFormatter interface.
Define a used-defined type with SqlUserDefinedType attribute.
SqlUserDefinedType is used to mark a type definition in an assembly as a user-defined type (UDT) in U -SQL.
The properties on the attribute reflect the physical characteristics of the UDT. This class cannot be inherited.
SqlUserDefinedType is a required attribute for UDT definition.
The constructor of the class:
SqlUserDefinedTypeAttribute (type formatter)
Type formatter: Required parameter to define an UDT formatter--specifically, the type of the IFormatter
interface must be passed here.
[SqlUserDefinedType(typeof(MyTypeFormatter))]
public class MyType
{ … }
Typical UDT also requires definition of the IFormatter interface, as shown in the following example:
The IFormatter interface serializes and de-serializes an object graph with the root type of <typeparamref
name="T">.
<typeparam name="T">The root type for the object graph to serialize and de-serialize.
Deserialize: De-serializes the data on the provided stream and reconstitutes the graph of objects.
Serialize: Serializes an object, or graph of objects, with the given root to the provided stream.
MyType instance: Instance of the type.
IColumnWriter writer / IColumnReader reader: The underlying column stream.
ISerializationContext context: Enum that defines a set of flags that specifies the source or destination context for
the stream during serialization.
Intermediate: Specifies that the source or destination context is not a persisted store.
Persistence: Specifies that the source or destination context is a persisted store.
As a regular C# type, a U -SQL UDT definition can include overrides for operators such as +/==/!=. It can also
include static methods. For example, if we are going to use this UDT as a parameter to a U -SQL MIN aggregate
function, we have to define < operator override.
Earlier in this guide, we demonstrated an example for fiscal period identification from the specific date in the
format Qn:Pn (Q1:P10) . The following example shows how to define a custom type for fiscal period values.
Following is an example of a code-behind section with custom UDT and IFormatter interface:
[SqlUserDefinedType(typeof(FiscalPeriodFormatter))]
public struct FiscalPeriod
{
public int Quarter { get; private set; }
The defined type includes two numbers: quarter and month. Operators ==/!=/>/< and static method ToString()
are defined here.
As mentioned earlier, UDT can be used in SELECT expressions, but cannot be used in OUTPUTTER/EXTRACTOR
without custom serialization. It either has to be serialized as a string with ToString() or used with a custom
OUTPUTTER/EXTRACTOR.
Now let’s discuss usage of UDT. In a code-behind section, we changed our GetFiscalPeriod function to the
following:
int FiscalQuarter = 0;
if (FiscalMonth >= 1 && FiscalMonth <= 3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file USING Extractors.Tsv();
@rs1 =
SELECT
guid AS start_id,
dt,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).Quarter AS fiscalquarter,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).Month AS fiscalmonth,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt) + new
USQL_Programmability.CustomFunctions.FiscalPeriod(1,7) AS fiscalperiod_adjusted,
user,
des
FROM @rs0;
@rs2 =
SELECT
start_id,
dt,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
fiscalquarter,
fiscalmonth,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).ToString() AS fiscalperiod,
// This user-defined type was created in the prior SELECT. Passing the UDT to this subsequent SELECT would
have failed if the UDT was not annotated with an IFormatter.
fiscalperiod_adjusted.ToString() AS fiscalperiod_adjusted,
user,
des
FROM @rs1;
OUTPUT @rs2
TO @output_file
USING Outputters.Text();
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace USQL_Programmability
{
public class CustomFunctions
{
static public DateTime? ToDateTime(string dt)
{
DateTime dtValue;
int FiscalQuarter = 0;
if (FiscalMonth >= 1 && FiscalMonth <= 3)
{
FiscalQuarter = 1;
}
if (FiscalMonth >= 4 && FiscalMonth <= 6)
{
FiscalQuarter = 2;
}
if (FiscalMonth >= 7 && FiscalMonth <= 9)
{
FiscalQuarter = 3;
}
if (FiscalMonth >= 10 && FiscalMonth <= 12)
{
FiscalQuarter = 4;
}
[SqlUserDefinedType(typeof(FiscalPeriodFormatter))]
public struct FiscalPeriod
{
public int Quarter { get; private set; }
[SqlUserDefinedAggregate]
public abstract class IAggregate<T1, T2, TResult> : IAggregate
{
protected IAggregate();
SqlUserDefinedAggregate indicates that the type should be registered as a user-defined aggregate. This class
cannot be inherited.
SqlUserDefinedType attribute is optional for UDAGG definition.
The base class allows you to pass three abstract parameters: two as input parameters and one as the result. The
data types are variable and should be defined during class inheritance.
Init invokes once for each group during computation. It provides an initialization routine for each aggregation
group.
Accumulate is executed once for each value. It provides the main functionality for the aggregation algorithm. It
can be used to aggregate values with various data types that are defined during class inheritance. It can accept
two parameters of variable data types.
Terminate is executed once per aggregation group at the end of processing to output the result for each group.
To declare correct input and output data types, use the class definition as follows:
or
AGG<UDAGG_functionname>(param1,param2)
@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file
USING Extractors.Tsv();
@rs1 =
SELECT
user,
AGG<USQL_Programmability.GuidAggregate>(guid,user) AS guid_list
FROM @rs0
GROUP BY user;
In this use-case scenario, we concatenate class GUIDs for the specific users.
NOTE
UDO’s are limited to consume 0.5Gb memory. This memory limitation does not apply to local executions.
[SqlUserDefinedExtractor]
public class SampleExtractor : IExtractor
{
public SampleExtractor(string row_delimiter, char col_delimiter)
{ … }
The SqlUserDefinedExtractor attribute indicates that the type should be registered as a user-defined extractor.
This class cannot be inherited.
SqlUserDefinedExtractor is an optional attribute for UDE definition. It used to define AtomicFileProcessing
property for the UDE object.
bool AtomicFileProcessing
true = Indicates that this extractor requires atomic input files (JSON, XML, ...)
false = Indicates that this extractor can deal with split / distributed files (CSV, SEQ, ...)
The main UDE programmability objects are input and output. The input object is used to enumerate input data as
IUnstructuredReader . The output object is used to set output data as a result of the extractor activity.
output.Set<string>(count, part);
}
else
{
// keep the rest of the columns as-is
output.Set<string>(count, part);
}
count += 1;
}
}
yield return output.AsReadOnly();
}
yield break;
}
}
In this use-case scenario, the extractor regenerates the GUID for “guid” column and converts the values of “user”
column to upper case. Custom extractors can produce more complicated results by parsing input data and
manipulating it.
Following is base U -SQL script that uses a custom extractor:
DECLARE @input_file string = @"\usql-programmability\input_file.tsv";
DECLARE @output_file string = @"\usql-programmability\output_file.tsv";
@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file
USING new USQL_Programmability.FullDescriptionExtractor(Encoding.UTF8);
All input parameters to the outputter, such as column/row delimiters, encoding, and so on, need to be defined in
the constructor of the class. The IOutputter interface should also contain a definition for void Output override.
The attribute [SqlUserDefinedOutputter(AtomicFileProcessing = true) can optionally be set for atomic file
processing. For more information, see the following details.
[SqlUserDefinedOutputter(AtomicFileProcessing = true)]
public class MyOutputter : IOutputter
{
Output is called for each input row. It returns the IUnstructuredWriter output rowset.
The Constructor class is used to pass parameters to the user-defined outputter.
Close is used to optionally override to release expensive state or determine when the last row was written.
SqlUserDefinedOutputter attribute indicates that the type should be registered as a user-defined outputter. This
class cannot be inherited.
SqlUserDefinedOutputter is an optional attribute for a user-defined outputter definition. It's used to define the
AtomicFileProcessing property.
bool AtomicFileProcessing
true = Indicates that this outputter requires atomic output files (JSON, XML, ...)
false = Indicates that this outputter can deal with split / distributed files (CSV, SEQ, ...)
The main programmability objects are row and output. The row object is used to enumerate output data as IRow
interface. Output is used to set output data to the target file.
The output data is accessed through the IRow interface. Output data is passed a row at a time.
The individual values are enumerated by calling the Get method of the IRow interface:
row.Get<string>("column_name")
This approach enables you to build a flexible outputter for any metadata schema.
The output data is written to file by using System.IO.StreamWriter . The stream parameter is set to
output.BaseStream as part of IUnstructuredWriter output .
Note that it's important to flush the data buffer to the file after each row iteration. In addition, the StreamWriter
object must be used with the Disposable attribute enabled (default) and with the using keyword:
using (StreamWriter streamWriter = new StreamWriter(output.BaseStream, this._encoding))
{
…
}
Otherwise, call Flush() method explicitly after each iteration. We show this in the following example.
Set headers and footers for user-defined outputter
To set a header, use single iteration execution flow.
…
if (isHeaderRow)
{
isHeaderRow = false;
}
…
}
}
[SqlUserDefinedOutputter(AtomicFileProcessing = true)]
public class HTMLOutputter : IOutputter
{
// Local variables initialization
private string row_delimiter;
private char col_delimiter;
private bool isHeaderRow;
private Encoding encoding;
private bool IsTableHeader = true;
private Stream g_writer;
// Parameters definition
public HTMLOutputter(bool isHeader = false, Encoding encoding = null)
{
this.isHeaderRow = isHeader;
this.encoding = ((encoding == null) ? Encoding.UTF8 : encoding);
}
// The Close method is used to write the footer to the file. It's executed only once, after all rows
public override void Close()
{
//Reference to IO.Stream object - g_writer
StreamWriter streamWriter = new StreamWriter(g_writer, this.encoding);
streamWriter.Write("</table>");
streamWriter.Flush();
streamWriter.Close();
}
if (isHeaderRow)
{
isHeaderRow = false;
}
// Reference to the instance of the IO.Stream object for footer generation
g_writer = output.BaseStream;
streamWriter.Flush();
}
}
@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file
USING new USQL_Programmability.FullDescriptionExtractor(Encoding.UTF8);
OUTPUT @rs0
TO @output_file
USING new USQL_Programmability.HTMLOutputter(isHeader: true);
This is an HTML outputter, which creates an HTML file with table data.
Call outputter from U -SQL base script
To call a custom outputter from the base U -SQL script, the new instance of the outputter object has to be created.
To avoid creating an instance of the object in base script, we can create a function wrapper, as shown in our earlier
example:
OUTPUT @rs0
TO @output_file
USING USQL_Programmability.Factory.HTMLOutputter(isHeader: true);
[SqlUserDefinedProcessor]
public class MyProcessor: IProcessor
{
public override IRow Process(IRow input, IUpdatableRow output)
{
…
}
}
SqlUserDefinedProcessor indicates that the type should be registered as a user-defined processor. This class
cannot be inherited.
The SqlUserDefinedProcessor attribute is optional for UDP definition.
The main programmability objects are input and output. The input object is used to enumerate input columns
and output, and to set output data as a result of the processor activity.
For input columns enumeration, we use the input.Get method.
The parameter for input.Get method is a column that's passed as part of the PRODUCE clause of the PROCESS
statement of the U -SQL base script. We need to use the correct data type here.
For output, use the output.Set method.
It's important to note that custom producer only outputs columns and values that are defined with the output.Set
method call.
output.Set<string>("mycolumn", mycolumn);
[SqlUserDefinedProcessor]
public class FullDescriptionProcessor : IProcessor
{
public override IRow Process(IRow input, IUpdatableRow output)
{
string user = input.Get<string>("user");
string des = input.Get<string>("des");
string full_description = user.ToUpper() + "=>" + des;
output.Set<string>("dt", input.Get<string>("dt"));
output.Set<string>("full_description", full_description);
output.Set<Guid>("new_guid", Guid.NewGuid());
output.Set<Guid>("guid", input.Get<Guid>("guid"));
return output.AsReadOnly();
}
}
In this use-case scenario, the processor is generating a new column called “full_description” by combining the
existing columns--in this case, “user” in upper case, and “des”. It also regenerates a GUID and returns the original
and new GUID values.
As you can see from the previous example, you can call C# methods during output.Set method call.
Following is an example of base U -SQL script that uses a custom processor:
@rs0 =
EXTRACT
guid Guid,
dt String,
user String,
des String
FROM @input_file USING Extractors.Tsv();
@rs1 =
PROCESS @rs0
PRODUCE dt String,
full_description String,
guid Guid,
new_guid Guid
USING new USQL_Programmability.FullDescriptionProcessor();
SELECT …
FROM …
CROSS APPLYis used to pass parameters
new MyScript.MyApplier(param1, param2) AS alias(output_param1 string, …);
For more information about using appliers in a SELECT expression, see U -SQL SELECT Selecting from CROSS
APPLY and OUTER APPLY.
The user-defined applier base class definition is as follows:
To define a user-defined applier, we need to create the IApplier interface with the [ SqlUserDefinedApplier ]
attribute, which is optional for a user-defined applier definition.
[SqlUserDefinedApplier]
public class ParserApplier : IApplier
{
public ParserApplier()
{
…
}
Apply is called for each row of the outer table. It returns the IUpdatableRow output rowset.
The Constructor class is used to pass parameters to the user-defined applier.
SqlUserDefinedApplier indicates that the type should be registered as a user-defined applier. This class cannot
be inherited.
SqlUserDefinedApplier is optional for a user-defined applier definition.
The main programmability objects are as follows:
Input rowsets are passed as IRow input. The output rows are generated as IUpdatableRow output interface.
Individual column names can be determined by calling the IRow Schema method.
To get the actual data values from the incoming IRow , we use the Get() method of IRow interface.
mycolumn = row.Get<int>("mycolumn")
row.Get<int>(row.Schema[0].Name)
output.Set<int>("mycolumn", mycolumn)
It is important to understand that custom appliers only output columns and values that are defined with
output.Set method call.
[SqlUserDefinedApplier]
public class ParserApplier : IApplier
{
private string parsingPart;
@rs0 =
EXTRACT
stocknumber int,
vin String,
properties String
FROM @input_file USING Extractors.Tsv();
@rs1 =
SELECT
r.stocknumber,
r.vin,
properties.make,
properties.model,
properties.year,
properties.type,
properties.millage
FROM @rs0 AS r
CROSS APPLY
new USQL_Programmability.ParserApplier ("all") AS properties(make string, model string, year string, type
string, millage int);
In this use case scenario, user-defined applier acts as a comma-delimited value parser for the car fleet properties.
The input file rows look like the following:
It is a typical tab-delimited TSV file with a properties column that contains car properties such as make and model.
Those properties must be parsed to the table columns. The applier that's provided also enables you to generate a
dynamic number of properties in the result rowset, based on the parameter that's passed. You can generate either
all properties or a specific set of properties only.
…USQL_Programmability.ParserApplier ("all")
…USQL_Programmability.ParserApplier ("make")
…USQL_Programmability.ParserApplier ("make&model")
Combine_Expression :=
'COMBINE' Combine_Input
'WITH' Combine_Input
Join_On_Clause
Produce_Clause
[Readonly_Clause]
[Required_Clause]
USING_Clause.
The custom implementation of an ICombiner interface should contain the definition for an IEnumerable<IRow>
Combine override.
[SqlUserDefinedCombiner]
public class MyCombiner : ICombiner
{
The SqlUserDefinedCombiner attribute indicates that the type should be registered as a user-defined combiner.
This class cannot be inherited.
SqlUserDefinedCombiner is used to define the Combiner mode property. It is an optional attribute for a user-
defined combiner definition.
CombinerMode Mode
CombinerMode enum can take the following values:
Full (0) Every output row potentially depends on all the input rows from left and right with the same key
value.
Left (1) Every output row depends on a single input row from the left (and potentially all rows from the right
with the same key value).
Right (2) Every output row depends on a single input row from the right (and potentially all rows from the
left with the same key value).
Inner (3) Every output row depends on a single input row from left and right with the same value.
Example: [ SqlUserDefinedCombiner(Mode=CombinerMode.Left) ]
The main programmability objects are:
Input rowsets are passed as left and right IRowset type of interface. Both rowsets must be enumerated for
processing. You can only enumerate each interface once, so we have to enumerate and cache it if necessary.
For caching purposes, we can create a List<T> type of memory structure as a result of a LINQ query execution,
specifically List< IRow >. The anonymous data type can be used during enumeration as well.
See Introduction to LINQ Queries (C#) for more information about LINQ queries, and IEnumerable<T> Interface
for more information about IEnumerable<T> interface.
To get the actual data values from the incoming IRowset , we use the Get() method of IRow interface.
mycolumn = row.Get<int>("mycolumn")
Individual column names can be determined by calling the IRow Schema method.
c# row.Get<int>(row.Schema[0].Name)
var myRowset =
(from row in left.Rows
select new
{
Mycolumn = row.Get<int>("mycolumn"),
}).ToList();
After enumerating both rowsets, we are going to loop through all rows. For each row in the left rowset, we are
going to find all rows that satisfy the condition of our combiner.
The output values must be set with IUpdatableRow output.
output.Set<int>("mycolumn", mycolumn)
var resellerSales =
(from row in right.Rows
select new
{
ProductKey = row.Get<int>("ProductKey"),
OrderDateKey = row.Get<int>("OrderDateKey"),
SalesAmount = row.Get<decimal>("SalesAmount"),
TaxAmt = row.Get<decimal>("TaxAmt")
}).ToList();
if (
row_i.OrderDateKey > 0
&& row_i.OrderDateKey < row_r.OrderDateKey
&& row_i.OrderDateKey == 20010701
&& (row_r.SalesAmount + row_r.TaxAmt) > 20000)
{
output.Set<int>("OrderDateKey", row_i.OrderDateKey);
output.Set<int>("ProductKey", row_i.ProductKey);
output.Set<decimal>("Internet_Sales_Amount", row_i.SalesAmount + row_i.TaxAmt);
output.Set<decimal>("Reseller_Sales_Amount", row_r.SalesAmount + row_r.TaxAmt);
}
}
}
yield return output.AsReadOnly();
}
}
In this use-case scenario, we are building an analytics report for the retailer. The goal is to find all products that cost
more than $20,000 and that sell through the website faster than through the regular retailer within a certain time
frame.
Here is the base U -SQL script. You can compare the logic between a regular JOIN and a combiner:
@fact_internet_sales =
EXTRACT
ProductKey int ,
ProductKey int ,
OrderDateKey int ,
DueDateKey int ,
ShipDateKey int ,
CustomerKey int ,
PromotionKey int ,
CurrencyKey int ,
SalesTerritoryKey int ,
SalesOrderNumber String ,
SalesOrderLineNumber int ,
RevisionNumber int ,
OrderQuantity int ,
UnitPrice decimal ,
ExtendedAmount decimal,
UnitPriceDiscountPct float ,
DiscountAmount float ,
ProductStandardCost decimal ,
TotalProductCost decimal ,
SalesAmount decimal ,
TaxAmt decimal ,
Freight decimal ,
CarrierTrackingNumber String,
CustomerPONumber String
FROM @input_file_internet_sales
USING Extractors.Text(delimiter:'|', encoding: Encoding.Unicode);
@fact_reseller_sales =
EXTRACT
ProductKey int ,
OrderDateKey int ,
DueDateKey int ,
ShipDateKey int ,
ResellerKey int ,
EmployeeKey int ,
PromotionKey int ,
CurrencyKey int ,
SalesTerritoryKey int ,
SalesOrderNumber String ,
SalesOrderLineNumber int ,
RevisionNumber int ,
OrderQuantity int ,
UnitPrice decimal ,
ExtendedAmount decimal,
UnitPriceDiscountPct float ,
DiscountAmount float ,
ProductStandardCost decimal ,
TotalProductCost decimal ,
SalesAmount decimal ,
TaxAmt decimal ,
Freight decimal ,
CarrierTrackingNumber String,
CustomerPONumber String
FROM @input_file_reseller_sales
USING Extractors.Text(delimiter:'|', encoding: Encoding.Unicode);
@rs1 =
SELECT
fis.OrderDateKey,
fis.ProductKey,
fis.SalesAmount+fis.TaxAmt AS Internet_Sales_Amount,
frs.SalesAmount+frs.TaxAmt AS Reseller_Sales_Amount
FROM @fact_internet_sales AS fis
INNER JOIN @fact_reseller_sales AS frs
ON fis.ProductKey == frs.ProductKey
WHERE
fis.OrderDateKey < frs.OrderDateKey
AND fis.OrderDateKey == 20010701
AND frs.SalesAmount+frs.TaxAmt > 20000;
@rs2 =
@rs2 =
COMBINE @fact_internet_sales AS fis
WITH @fact_reseller_sales AS frs
ON fis.ProductKey == frs.ProductKey
PRODUCE OrderDateKey int,
ProductKey int,
Internet_Sales_Amount decimal,
Reseller_Sales_Amount decimal
USING new USQL_Programmability.CombineSales();
USING MyNameSpace.MyCombiner();
[SqlUserDefinedReducer]
public class EmptyUserReducer : IReducer
{
The SqlUserDefinedReducer attribute indicates that the type should be registered as a user-defined reducer. This
class cannot be inherited. SqlUserDefinedReducer is an optional attribute for a user-defined reducer definition.
It's used to define IsRecursive property.
bool IsRecursive
true = Indicates whether this Reducer is associative and commutative
The main programmability objects are input and output. The input object is used to enumerate input rows.
Output is used to set output rows as a result of reducing activity.
For input rows enumeration, we use the Row.Get method.
foreach (IRow row in input.Rows)
{
row.Get<string>("mycolumn");
}
The parameter for the Row.Get method is a column that's passed as part of the PRODUCE class of the REDUCE
statement of the U -SQL base script. We need to use the correct data type here as well.
For output, use the output.Set method.
It is important to understand that custom reducer only outputs values that are defined with the output.Set
method call.
output.Set<string>("mycolumn", guid);
[SqlUserDefinedReducer]
public class EmptyUserReducer : IReducer
{
if (user.Length > 0)
{
output.Set<string>("guid", guid);
output.Set<DateTime>("dt", dt);
output.Set<string>("user", user);
output.Set<string>("des", des);
In this use-case scenario, the reducer is skipping rows with an empty user name. For each row in rowset, it reads
each required column, then evaluates the length of the user name. It outputs the actual row only if user name value
length is more than 0.
Following is base U -SQL script that uses a custom reducer:
DECLARE @input_file string = @"\usql-programmability\input_file_reducer.tsv";
DECLARE @output_file string = @"\usql-programmability\output_file.tsv";
@rs0 =
EXTRACT
guid string,
dt DateTime,
user String,
des String
FROM @input_file
USING Extractors.Tsv();
@rs1 =
REDUCE @rs0 PRESORT guid
ON guid
PRODUCE guid string, dt DateTime, user String, des String
USING new USQL_Programmability.EmptyUserReducer();
@rs2 =
SELECT guid AS start_id,
dt AS start_time,
DateTime.Now.ToString("M/d/yyyy") AS Nowdate,
USQL_Programmability.CustomFunctions.GetFiscalPeriodWithCustomType(dt).ToString() AS
start_fiscalperiod,
user,
des
FROM @rs1;
OUTPUT @rs2
TO @output_file
USING Outputters.Text();
Install Data Lake Tools for Visual Studio
11/7/2018 • 2 minutes to read • Edit Online
Learn how to use Visual Studio to create Azure Data Lake Analytics accounts, define jobs in U -SQL, and submit
jobs to the Data Lake Analytics service. For more information about Data Lake Analytics, see Azure Data Lake
Analytics overview.
Prerequisites
Visual Studio: All editions except Express are supported.
Visual Studio 2017
Visual Studio 2015
Visual Studio 2013
Microsoft Azure SDK for .NET version 2.7.1 or later. Install it by using the Web platform installer.
A Data Lake Analytics account. To create an account, see Get Started with Azure Data Lake Analytics using
Azure portal.
Next Steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio
Run U-SQL scripts on your local machine
8/27/2018 • 6 minutes to read • Edit Online
When you develop U -SQL scripts, you can save time and expense by running the scripts locally. Azure Data Lake
Tools for Visual Studio supports running U -SQL scripts on your local machine.
Storage Local data root folder Default Azure Data Lake Store account
Compute U-SQL local run engine Azure Data Lake Analytics service
Run environment Working directory on local machine Azure Data Lake Analytics cluster
The sections that follow provide more information about local run components.
Local data root folders
A local data root folder is a local store for the local compute account. Any folder in the local file system on your
local machine can be a local data root folder. It's the same as the default Azure Data Lake Store account of a Data
Lake Analytics account. Switching to a different data root folder is just like switching to a different default store
account.
The data root folder is used as follows:
Store metadata. Examples are databases, tables, table-valued functions, and assemblies.
Look up the input and output paths that are defined as relative paths in U -SQL scripts. By using relative paths,
it's easier to deploy your U -SQL scripts to Azure.
U -SQL local run engines
A U -SQL local run engine is a local compute account for U -SQL jobs. Users can run U -SQL jobs locally
through Azure Data Lake Tools for Visual Studio. Local runs are also supported through the Azure Data Lake U -
SQL SDK command-line and programming interfaces. Learn more about the Azure Data Lake U -SQL SDK.
Working directories
When you run a U -SQL script, a working directory folder is needed to cache compilation results, run logs, and
perform other functions. In Azure Data Lake Tools for Visual Studio, the working directory is the U -SQL project’s
working directory. It's located under <U-SQL project root path>/bin/debug> . The working directory is cleaned
every time a new run is triggered.
A U -SQL project is required for a local run. The U -SQL project’s working directory is used for the U -SQL local
run working directory. Compilation results, run logs, and other job run-related files are generated and stored
under the working directory folder during the local run. Every time you rerun the script, all the files in the working
directory are cleaned and regenerated.
Local access Can be accessed by all projects. Only the corresponding project can
access this account.
DIFFERENCE ANGLE LOCAL-MACHINE LOCAL-PROJECT
Local data root folder A permanent local folder. Configured A temporary folder created for each
through Tools > Data Lake > Options local run under the U-SQL project
and Settings. working directory. The folder gets
cleaned when a rebuild or rerun
happens.
Input data for a U-SQL script The relative path under the permanent Set through U-SQL project property
local data root folder. > Test Data Source. All files and
subfolders are copied to the temporary
data root folder before a local run.
Output data for a U-SQL script Relative path under the permanent Output to the temporary data root
local data root folder. folder. The results are cleaned when a
rebuild or rerun happens.
Referenced database deployment Referenced databases aren't deployed Referenced databases are deployed to
automatically when running against a the Local-project account
Local-machine account. It's the same automatically before a local run. All
for submitting to an Azure Data Lake database environments are cleaned and
Analytics account. redeployed when a rebuild or rerun
happens.
Next steps
How to set up a CI/CD pipeline for Azure Data Lake Analytics.
How to test your Azure Data Lake Analytics code.
Debug Azure Data Lake Analytics code locally
8/27/2018 • 2 minutes to read • Edit Online
You can use Azure Data Lake Tools for Visual Studio to run and debug Azure Data Lake Analytics code on your
local workstation, just as you can in the Azure Data Lake Analytics service.
Learn how to run U -SQL script on your local machine.
NOTE
The following procedure works only in Visual Studio 2015. In older Visual Studio versions, you might need to manually add
the PDB files.
Next steps
For an example of a more complex query, see Analyze website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data Lake Analytics jobs.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Use a U-SQL database project to develop a U-SQL
database for Azure Data Lake
4/9/2019 • 4 minutes to read • Edit Online
U -SQL database provides structured views over unstructured data and managed structured data in tables. It also
provides a general metadata catalog system for organizing your structured data and custom code. The database is
the concept that groups these related objects together.
Learn more about U -SQL database and Data Definition Language (DDL ).
The U -SQL database project is a project type in Visual Studio that helps developers develop, manage, and deploy
their U -SQL databases quickly and easily.
2. In the assembly design view, choose the referenced assembly from Create assembly from reference
drop-down menu.
3. Add Managed Dependencies and Additional Files if there are any. When you add additional files, the
tool uses the relative path to make sure it can find the assemblies both on your local machine and on the
build machine later.
@_DeployTempDirectory is a predefined variable that points the tool to the build output folder. Under the build
output folder, every assembly has a subfolder named with the assembly name. All DLLs and additional files are in
that subfolder.
2. Configure a database reference from a U -SQL database project in the current solution or in a U -SQL
database package file.
3. Provide the name for the database.
Next steps
How to set up a CI/CD pipeline for Azure Data Lake Analytics
How to test your Azure Data Lake Analytics code
Run U -SQL script on your local machine
Use Job Browser and Job View for Azure Data Lake
Analytics
2/6/2019 • 10 minutes to read • Edit Online
The Azure Data Lake Analytics service archives submitted jobs in a query store. In this article, you learn how to
use Job Browser and Job View in Azure Data Lake Tools for Visual Studio to find the historical job information.
By default, the Data Lake Analytics service archives the jobs for 30 days. The expiration period can be configured
from the Azure portal by configuring the customized expiration policy. You will not be able to access the job
information after expiration.
Prerequisites
See Data Lake Tools for Visual Studio prerequisites.
Job View
Job View shows the detailed information of a job. To open a job, you can double-click a job in the Job Browser, or
open it from the Data Lake menu by clicking Job View. You should see a dialog populated with the job URL.
Job Result: Succeeded or failed. The job may fail in every phase.
Total Duration: Wall clock time (duration) between submitting time and ending time.
Total Compute Time: The sum of every vertex execution time, you can consider it as the time that
the job is executed in only one vertex. Refer to Total Vertices to find more information about
vertex.
Submit/Start/End Time: The time when the Data Lake Analytics service receives job
submission/starts to run the job/ends the job successfully or not.
Compilation/Queued/Running: Wall clock time spent during the Preparing/Queued/Running
phase.
Account: The Data Lake Analytics account used for running the job.
Author: The user who submitted the job, it can be a real person’s account or a system account.
Priority: The priority of the job. The lower the number, the higher the priority. It only affects the
sequence of the jobs in the queue. Setting a higher priority does not preempt running jobs.
Parallelism: The requested maximum number of concurrent Azure Data Lake Analytics Units
(ADLAUs), aka vertices. Currently, one vertex is equal to one VM with two virtual core and six GB
RAM, though this could be upgraded in future Data Lake Analytics updates.
Bytes Left: Bytes that need to be processed until the job completes.
Bytes read/written: Bytes that have been read/written since the job started running.
Total vertices: The job is broken up into many pieces of work, each piece of work is called a vertex.
This value describes how many pieces of work the job consists of. You can consider a vertex as a
basic process unit, aka Azure Data Lake Analytics Unit (ADLAU ), and vertices can be run in
parallelism.
Completed/Running/Failed: The count of completed/running/failed vertices. Vertices can fail due
to both user code and system failures, but the system retries failed vertices automatically a few
times. If the vertex is still failed after retrying, the whole job will fail.
Job Graph
A U -SQL script represents the logic of transforming input data to output data. The script is compiled and
optimized to a physical execution plan at the Preparing phase. Job Graph is to show the physical execution
plan. The following diagram illustrates the process:
A job is broken up into many pieces of work. Each piece of work is called a Vertex. The vertices are grouped
as Super Vertex (aka stage), and visualized as Job Graph. The green stage placards in the job graph show
the stages.
Every vertex in a stage is doing the same kind of work with different pieces of the same data. For example, if
you have a file with one TB data, and there are hundreds of vertices reading from it, each of them is reading
a chunk. Those vertices are grouped in the same stage and doing same work on different pieces of same
input file.
Stage information
In a particular stage, some numbers are shown in the placard.
SV1 Extract: The name of a stage, named by a number and the operation method.
84 vertices: The total count of vertices in this stage. The figure indicates how many pieces of
work is divided in this stage.
12.90 s/vertex: The average vertex execution time for this stage. This figure is calculated by
SUM (every vertex execution time) / (total Vertex count). Which means if you could assign all
the vertices executed in parallelism, the whole stage is completed in 12.90 s. It also means if
all the work in this stage is done serially, the cost would be #vertices * AVG time.
850,895 rows written: Total row count written in this stage.
R/W: Amount of data read/Written in this stage in bytes.
Colors: Colors are used in the stage to indicate different vertex status.
Green indicates the vertex is succeeded.
Orange indicates the vertex is retried. The retried vertex was failed but is retried
automatically and successfully by the system, and the overall stage is completed
successfully. If the vertex retried but still failed, the color turns red and the whole job failed.
Red indicates failed, which means a certain vertex had been retried a few times by the
system but still failed. This scenario causes the whole job to fail.
Blue means a certain vertex is running.
White indicates the vertex is Waiting. The vertex may be waiting to be scheduled once an
ADLAU becomes available, or it may be waiting for input since its input data might not be
ready.
You can find more details for the stage by hovering your mouse cursor by one state:
Vertices: Describes the vertices details, for example, how many vertices in total, how many vertices
have been completed, are they failed or still running/waiting, etc.
Data read cross/intra pod: Files and data are stored in multiple pods in distributed file system. The
value here describes how much data has been read in the same pod or cross pod.
Total compute time: The sum of every vertex execution time in the stage, you can consider it as the
time it would take if all work in the stage is executed in only one vertex.
Data and rows written/read: Indicates how much data or rows have been read/written, or need to be
read.
Vertex read failures: Describes how many vertices are failed while read data.
Vertex duplicate discards: If a vertex runs too slow, the system may schedule multiple vertices to run
the same piece of work. Reductant vertices will be discarded once one of the vertices complete
successfully. Vertex duplicate discards records the number of vertices that are discarded as
duplications in the stage.
Vertex revocations: The vertex was succeeded, but get rerun later due to some reasons. For example,
if downstream vertex loses intermediate input data, it will ask the upstream vertex to rerun.
Vertex schedule executions: The total time that the vertices have been scheduled.
Min/Average/Max Vertex data read: The minimum/average/maximum of every vertex read data.
Duration: The wall clock time a stage takes, you need to load profile to see this value.
Job Playback
Data Lake Analytics runs jobs and archives the vertices running information of the jobs, such as
when the vertices are started, stopped, failed and how they are retried, etc. All of the information is
automatically logged in the query store and stored in its Job Profile. You can download the Job
Profile through “Load Profile” in Job View, and you can view the Job Playback after downloading the
Job Profile.
Job Playback is an epitome visualization of what happened in the cluster. It helps you watch job
execution progress and visually detect out performance anomalies and bottlenecks in a very short
time (less than 30s usually).
Job Heat Map Display
Job Heat Map can be selected through the Display dropdown in Job Graph.
It shows the I/O, time and throughput heat map of a job, through which you can find where the job
spends most of the time, or whether your job is an I/O boundary job, and so on.
Progress: The job execution progress, see Information in stage information.
Data read/written: The heat map of total data read/written in each stage.
Compute time: The heat map of SUM (every vertex execution time), you can consider this as how
long it would take if all work in the stage is executed with only 1 vertex.
Average execution time per node: The heat map of SUM (every vertex execution time) / (Vertex
Number). Which means if you could assign all the vertices executed in parallelism, the whole
stage will be done in this time frame.
Input/Output throughput: The heat map of input/output throughput of each stage, you can
confirm if your job is an I/O bound job through this.
Metadata Operations
You can perform some metadata operations in your U -SQL script, such as create a database, drop a table,
etc. These operations are shown in Metadata Operation after compilation. You may find assertions, create
entities, drop entities here.
State History
The State History is also visualized in Job Summary, but you can get more details here. You can find the
detailed information such as when the job is prepared, queued, started running, ended. Also you can find
how many times the job has been compiled (the CcsAttempts: 1), when is the job dispatched to the cluster
actually (the Detail: Dispatching job to cluster), etc.
Diagnostics
The tool diagnoses job execution automatically. You will receive alerts when there are some errors or
performance issues in your jobs. Please note that you need to download Profile to get full information here.
Warnings: An alert shows up here with compiler warning. You can click “x issue(s)” link to have more
details once the alert appears.
Vertex run too long: If any vertex runs out of time (say 5 hours), issues will be found here.
Resource usage: If you allocated more or not enough Parallelism than need, issues will be found here.
Also you can click Resource usage to see more details and perform what-if scenarios to find a better
resource allocation (for more details, see this guide).
Memory check: If any vertex uses more than 5 GB of memory, issues will be found here. Job execution
may get killed by system if it uses more memory than system limitation.
Job Detail
Job Detail shows the detailed information of the job, including Script, Resources and Vertex Execution View.
Script
The U -SQL script of the job is stored in the query store. You can view the original U -SQL script and re-
submit it if needed.
Resources
You can find the job compilation outputs stored in the query store through Resources. For instance, you can
find “algebra.xml” which is used to show the Job Graph, the assemblies you registered, etc. here.
Vertex execution view
It shows vertices execution details. The Job Profile archives every vertex execution log, such as total data
read/written, runtime, state, etc. Through this view, you can get more details on how a job ran. For more
information, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.
Next Steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To use vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio
Debug user-defined C# code for failed U-SQL jobs
4/11/2019 • 3 minutes to read • Edit Online
U -SQL provides an extensibility model using C#. In U -SQL scripts, it is easy to call C# functions and perform
analytic functions that SQL -like declarative language does not support. To learn more for U -SQL extensibility, see
U -SQL programmability guide.
In practice, any code may need debugging, but it is hard to debug a distributed job with custom code on the cloud
with limited log files. Azure Data Lake Tools for Visual Studio provides a feature called Failed Vertex Debug,
which helps you more easily debug the failures that occur in your custom code. When U -SQL job fails, the service
keeps the failure state and the tool helps you to download the cloud failure environment to the local machine for
debugging. The local download captures the entire cloud environment, including any input data and user code.
The following video demonstrates Failed Vertex Debug in Azure Data Lake Tools for Visual Studio.
IMPORTANT
Visual Studio requires the following two updates for using this feature: Microsoft Visual C++ 2015 Redistributable Update 3
and the Universal C Runtime for Windows.
In the new launched Visual Studio instance, you may or may not find the user-defined C# source code:
1. I can find my source code in the solution
2. I cannot find my source code in the solution
Source code is included in debugging solution
There are two cases that the C# source code is captured:
1. The user code is defined in code-behind file (typically named Script.usql.cs in a U -SQL project).
2. The user code is defined in C# class library project for U -SQL application, and registered as assembly with
debug info.
If the source code is imported to the solution, you can use the Visual Studio debugging tools (watch, variables, etc.)
to troubleshoot the problem:
1. Press F5 to start debugging. The code runs until it is stopped by an exception.
2. Open the source code file and set breakpoints, then press F5 to debug the code step by step.
After these settings, start debugging with F5 and breakpoints. You can also use the Visual Studio debugging tools
(watch, variables, etc.) to troubleshoot the problem.
NOTE
Rebuild the assembly source code project each time after you modify the code to generate updated .pdb files.
2. For jobs with assemblies, right-click the assembly source code project in debugging solution and register
the updated .dll assemblies into your Azure Data Lake catalog.
3. Resubmit the U -SQL job.
Next steps
U -SQL programmability guide
Develop U -SQL User-defined operators for Azure Data Lake Analytics jobs
Test and debug U -SQL jobs by using local run and the Azure Data Lake U -SQL SDK
How to troubleshoot an abnormal recurring job
Troubleshoot an abnormal recurring job
11/7/2018 • 2 minutes to read • Edit Online
This article shows how to use Azure Data Lake Tools for Visual Studio to troubleshoot problems with recurring
jobs. Learn more about pipeline and recurring jobs from the Azure Data Lake and Azure HDInsight blog.
Recurring jobs usually share the same query logic and similar input data. For example, imagine that you have a
recurring job running every Monday morning at 8 A.M. to count last week’s weekly active user. The scripts for
these jobs share one script template that contains the query logic. The inputs for these jobs are the usage data for
last week. Sharing the same query logic and similar input usually means that performance of these jobs is similar
and stable. If one of your recurring jobs suddenly performs abnormally, fails, or slows down a lot, you might want
to:
See the statistics reports for the previous runs of the recurring job to see what happened.
Compare the abnormal job with a normal one to figure out what has been changed.
Related Job View in Azure Data Lake Tools for Visual Studio helps you accelerate the troubleshooting progress
with both cases.
Case 2: You have the pipeline for the recurring job, but not the URL
In Visual Studio, you can open Pipeline Browser through Server Explorer > your Azure Data Lake Analytics
account > Pipelines. (If you can't find this node in Server Explorer, download the latest plug-in.)
In Pipeline Browser, all pipelines for the Data Lake Analytics account are listed at left. You can expand the pipelines
to find all recurring jobs, and then select the one that has problems. Related Job View opens at right.
Step 2: Analyze a statistics report
A summary and a statistics report are shown at top of Related Job View. There, you can find the potential root
cause of the problem.
1. In the report, the X-axis shows the job submission time. Use it to find the abnormal job.
2. Use the process in the following diagram to check statistics and get insights about the problem and the possible
solutions.
Next steps
Resolve data-skew problems
Debug user-defined C# code for failed U -SQL jobs
Use the Vertex Execution View in Data Lake Tools for
Visual Studio
8/27/2018 • 2 minutes to read • Edit Online
Learn how to use the Vertex Execution View to exam Data Lake Analytics jobs.
The top center pane shows the running status of all the vertices.
Next steps
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics
To see a more complex query, see Analyze Website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data lake Analytics jobs
Export a U-SQL database
4/9/2019 • 3 minutes to read • Edit Online
In this article, learn how to use Azure Data Lake Tools for Visual Studio to export a U -SQL database as a single U -
SQL script and downloaded resources. You can import the exported database to a local account in the same
process.
Customers usually maintain multiple environments for development, test, and production. These environments are
hosted on both a local account, on a developer's local computer, and in an Azure Data Lake Analytics account in
Azure.
When you develop and tune U -SQL queries in development and test environments, developers often need to re-
create their work in a production database. The Database Export Wizard helps accelerate this process. By using the
wizard, developers can clone the existing database environment and sample data to other Data Lake Analytics
accounts.
Export steps
Step 1: Export the database in Server Explorer
All Data Lake Analytics accounts that you have permissions for are listed in Server Explorer. To export the
database:
1. In Server Explorer, expand the account that contains the database that you want to export.
2. Right-click the database, and then select Export.
If the Export menu option isn't available, you need to update the tool to the lasted release.
Step 2: Configure the objects that you want to export
If you need only a small part of a large database, you can configure a subset of objects that you want to export in
the export wizard.
The export action is completed by running a U -SQL job. Therefore, exporting from an Azure account incurs some
cost.
Step 3: Check the objects list and other configurations
In this step, you can verify the selected objects in the Export object list box. If there are any errors, select
Previous to go back and correctly configure the objects that you want to export.
You can also configure other settings for the export target. Configuration descriptions are listed in the following
table:
CONFIGURATION DESCRIPTION
Destination Name This name indicates where you want to save the exported
database resources. Examples are assemblies, additional files,
and sample data. A folder with this name is created under
your local data root folder.
Project Directory This path defines where you want to save the exported U-SQL
script. All database object definitions are saved at this location.
Schema Only If you select this option, only database definitions and
resources (like assemblies and additional files) are exported.
Schema and Data If you select this option, database definitions, resources, and
data are exported. The top N rows of tables are exported.
Import to Local Database Automatically If you select this option, the exported database is
automatically imported to your local database when exporting
is finished.
Step 4: Check the export results
When exporting is finished, you can view the exported results in the log window in the wizard. The following
example shows how to find exported U -SQL script and database resources, including assemblies, additional files,
and sample data:
Import the exported database to a local account
The most convenient way to import the exported database is to select the Import to Local Database
Automatically check box during the exporting process in Step 3. If you didn't check this box, first, find the
exported U -SQL script in the export log. Then, run the U -SQL script locally to import the database to your local
account.
Known limitations
Currently, if you select the Schema and Data option in Step 3, the tool runs a U -SQL job to export the data stored
in tables. Because of this, the data exporting process might be slow and you might incur costs.
Next steps
Learn about U -SQL databases
Test and debug U -SQL jobs by using local run and the Azure Data Lake U -SQL SDK
Analyze Website logs using Azure Data Lake
Analytics
1/18/2019 • 4 minutes to read • Edit Online
Learn how to analyze website logs using Data Lake Analytics, especially on finding out which referrers ran into
errors when they tried to visit the website.
Prerequisites
Visual Studio 2015 or Visual Studio 2013.
Data Lake Tools for Visual Studio.
Once Data Lake Tools for Visual Studio is installed, you will see a Data Lake item in the Tools menu in
Visual Studio:
Basic knowledge of Data Lake Analytics and the Data Lake Tools for Visual Studio. To get started,
see:
Develop U -SQL script using Data Lake tools for Visual Studio.
A Data Lake Analytics account. See Create an Azure Data Lake Analytics account.
Install the sample data. In the Azure Portal, open you Data Lake Analytics account and click Sample
Scripts on the left menu, then click Copy Sample Data.
Connect to Azure
Before you can build and test any U -SQL scripts, you must first connect to Azure.
To connect to Data Lake Analytics
1. Open Visual Studio.
2. Click Data Lake > Options and Settings.
3. Click Sign In, or Change User if someone has signed in, and follow the instructions.
4. Click OK to close the Options and Settings dialog.
To browse your Data Lake Analytics accounts
1. From Visual Studio, open Server Explorer by press CTRL+ALT+S.
2. From Server Explorer, expand Azure, and then expand Data Lake Analytics. You shall see a list of your
Data Lake Analytics accounts if there are any. You cannot create Data Lake Analytics accounts from the studio.
To create an account, see Get Started with Azure Data Lake Analytics using Azure Portal or Get Started with
Azure Data Lake Analytics using Azure PowerShell.
// Create a database for easy reuse, so you don't need to read from a file every time.
CREATE DATABASE IF NOT EXISTS SampleDBTutorials;
// Create a Table valued function. TVF ensures that your jobs fetch data from the weblog file with the
correct schema.
DROP FUNCTION IF EXISTS SampleDBTutorials.dbo.WeblogsView;
CREATE FUNCTION SampleDBTutorials.dbo.WeblogsView()
RETURNS @result TABLE
(
s_date DateTime,
s_time string,
s_sitename string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
cs_cookie string,
cs_referer string,
cs_host string,
sc_status int,
sc_substatus int,
sc_win32status int,
sc_bytes int,
cs_bytes int,
s_timetaken int
)
AS
BEGIN
@result = EXTRACT
s_date DateTime,
s_time string,
s_sitename string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
cs_cookie string,
cs_referer string,
cs_host string,
sc_status int,
sc_substatus int,
sc_win32status int,
sc_bytes int,
cs_bytes int,
s_timetaken int
FROM @"/Samples/Data/WebLog.log"
USING Extractors.Text(delimiter:' ');
RETURN;
END;
To understand the U -SQL, see Get started with Data Lake Analytics U -SQL language.
5. Add a new U -SQL script to your project and enter the following:
// Query the referrers that ran into errors
@content =
SELECT *
FROM SampleDBTutorials.dbo.ReferrersPerDay
WHERE sc_status >=400 AND sc_status < 500;
OUTPUT @content
TO @"/Samples/Outputs/UnsuccessfulResponses.log"
USING Outputters.Tsv();
6. Switch back to the first U -SQL script and next to the Submit button, specify your Analytics account.
7. From Solution Explorer, right click Script.usql, and then click Build Script. Verify the results in the
Output pane.
8. From Solution Explorer, right click Script.usql, and then click Submit Script.
9. Verify the Analytics Account is the one where you want to run the job, and then click Submit.
Submission results and job link are available in the Data Lake Tools for Visual Studio Results window when
the submission is completed.
10. Wait until the job is completed successfully. If the job failed, it is most likely missing the source file. Please
see the Prerequisite section of this tutorial. For additional troubleshooting information, see Monitor and
troubleshoot Azure Data Lake Analytics jobs.
When the job is completed, you shall see the following screen:
11. Now repeat steps 7- 10 for Script1.usql.
To see the job output
1. From Server Explorer, expand Azure, expand Data Lake Analytics, expand your Data Lake Analytics
account, expand Storage Accounts, right-click the default Data Lake Storage account, and then click
Explorer.
2. Double-click Samples to open the folder, and then double-click Outputs.
3. Double-click UnsuccessfulResponses.log.
4. You can also double-click the output file inside the graph view of the job in order to navigate directly to the
output.
See also
To get started with Data Lake Analytics using different tools, see:
Get started with Data Lake Analytics using Azure Portal
Get started with Data Lake Analytics using Azure PowerShell
Get started with Data Lake Analytics using .NET SDK
Resolve data-skew problems by using Azure Data
Lake Tools for Visual Studio
5/14/2019 • 7 minutes to read • Edit Online
In our scenario, the data is unevenly distributed across all tax examiners, which means that some examiners must
work more than others. In your own job, you frequently experience situations like the tax-examiner example here.
In more technical terms, one vertex gets much more data than its peers, a situation that makes the vertex work
more than the others and that eventually slows down an entire job. What's worse, the job might fail, because
vertices might have, for example, a 5-hour runtime limitation and a 6-GB memory limitation.
NOTE
Statistics information is not updated automatically. If you update the data in a table without re-creating the statistics, the
query performance might decline.
SKEWFACTOR (columns) = x
Provides a hint that the given columns have a skew factor x from 0 (no skew) through 1 (very heavy skew).
Code example:
//Add a SKEWFACTOR hint.
@Impressions =
SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
OPTION(SKEWFACTOR(Query)=0.5)
;
OPTION(ROWCOUNT = n)
Identify a small row set before JOIN by providing an estimated integer row count.
Code example:
[SqlUserDefinedReducer(IsRecursive = true)]
Code example:
[SqlUserDefinedReducer(IsRecursive = true)]
public class TopNReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
//Your reducer code goes here.
}
}
[SqlUserDefinedCombiner(Mode = CombinerMode.Right)]
public class WatsonDedupCombiner : ICombiner
{
public override IEnumerable<IRow>
Combine(IRowset left, IRowset right, IUpdatableRow output)
{
//Your combiner code goes here.
}
}
Monitor jobs in Azure Data Lake Analytics using the
Azure Portal
8/27/2018 • 2 minutes to read • Edit Online
The job Management gives you a glance of the job status. Notice there is a failed job.
3. Click the Job Management tile to see the jobs. The jobs are categorized in Running, Queued, and
Ended. You shall see your failed job in the Ended section. It shall be first one in the list. When you have a
lot of jobs, you can click Filter to help you to locate jobs.
4. Click the failed job from the list to open the job details:
Notice the Resubmit button. After you fix the problem, you can resubmit the job.
5. Click highlighted part from the previous screenshot to open the error details. You shall see something like:
See also
Azure Data Lake Analytics overview
Get started with Azure Data Lake Analytics using Azure PowerShell
Manage Azure Data Lake Analytics using Azure portal
Use Azure Data Lake Tools for Visual Studio Code
3/6/2019 • 14 minutes to read • Edit Online
In this article, learn how you can use Azure Data Lake Tools for Visual Studio Code (VS Code) to create, test, and
run U -SQL scripts. The information is also covered in the following video:
Prerequisites
Azure Data Lake Tools for VS Code supports Windows, Linux, and macOS. U -SQL local run and local debug
works only in Windows.
Visual Studio Code
For MacOS and Linux:
.NET Core SDK 2.0
Mono 5.2.x
@departments =
SELECT * FROM
(VALUES
(31, "Sales"),
(33, "Engineering"),
(34, "Clerical"),
(35, "Marketing")
) AS
D( DepID, DepName );
OUTPUT @departments
TO "/Output/departments.csv"
USING Outputters.Csv();
The script creates a departments.csv file with some data included in the /output folder.
5. Save the file as myUSQL.usql in the opened folder.
To compile a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Compile Script. The compile results appear in the Output window. You can also right-click a
script file, and then select ADL: Compile Script to compile a U -SQL job. The compilation result appears in
the Output pane.
To submit a U -SQL script
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Submit Job. You can also right-click a script file, and then select ADL: Submit Job.
After you submit a U -SQL job, the submission logs appear in the Output window in VS Code. The job view
appears in the right pane. If the submission is successful, the job URL appears too. You can open the job URL in a
web browser to track the real-time job status.
On the job view's SUMMARY tab, you can see the job details. Main functions include resubmit a script, duplicate
a script, and open in the portal. On the job view's DATA tab, you can refer to the input files, output files, and
resource files. Files can be downloaded to the local computer.
NOTE
Azure Data Lake Tools autodetects whether the DLL has any assembly dependencies. The dependencies are displayed in
the JSON file after they're detected.
You can upload your DLL resources (for example, .txt, .png, and .csv) as part of the assembly registration.
Another way to trigger the ADL: Register Assembly (Advanced) command is to right-click the .dll file in File
Explorer.
The following U -SQL code demonstrates how to call an assembly. In the sample, the assembly name is test.
REFERENCE ASSEMBLY [test];
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"Sample/SearchLog.txt"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"Sample/SearchLogtest.txt"
USING Outputters.Tsv();
Use U-SQL local run and local debug for Windows users
U -SQL local run tests your local data and validates your script locally before your code is published to Data Lake
Analytics. You can use the local debug feature to complete the following tasks before your code is submitted to
Data Lake Analytics:
Debug your C# code-behind.
Step through the code.
Validate your script locally.
The local run and local debug feature only works in Windows environments, and is not supported on macOS and
Linux-based operating systems.
For instructions on local run and local debug, see U -SQL local run and local debug with Visual Studio Code.
Connect to Azure
Before you can compile and run U -SQL scripts in Data Lake Analytics, you must connect to your Azure account.
To connect to Azure by using a command
1. Select Ctrl+Shift+P to open the command palette.
2. Enter ADL: Login. The login information appears on the lower right.
3. Select Copy & Open to open the login webpage. Paste the code into the box, and then select Continue.
4. Follow the instructions to sign in from the webpage. When you're connected, your Azure account name
appears on the status bar in the lower-left corner of the VS Code window.
NOTE
Data Lake Tools automatically signs you in the next time if you don't sign out.
If your account has two factors enabled, we recommend that you use phone authentication rather than using a PIN.
You can't sign out from the explorer. To sign out, see To connect to Azure by using a command.
A more convenient way to list the relative path is through the shortcut menu.
To list the storage path through the shortcut menu
Right-click the path string and select List Path.
Another way to preview the file is through the shortcut menu on the file's full path or the file's relative path in the
script editor.
Upload a file or folder
1. Right-click the script editor and select Upload File or Upload Folder.
2. Choose one file or multiple files if you selected Upload File, or choose the whole folder if you selected
Upload Folder. Then select Upload.
3. Choose the storage folder in the list, or select Enter a path or Browse from root path. (We're using Enter a
path as an example.)
4. Select your Data Lake Analytics account.
5. Browse to or enter the storage folder path (for example, /output/).
6. Select Choose Current Folder to specify your upload destination.
Another way to upload files to storage is through the shortcut menu on the file's full path or the file's relative path
in the script editor.
You can monitor the upload status.
Download a file
You can download a file by using the command ADL: Download File or ADL: Download File (Advanced).
To download a file through the ADL: Download File (Advanced) command
1. Right-click the script editor, and then select Download File (Advanced).
2. VS Code displays a JSON file. You can enter file paths and download multiple files at the same time.
Instructions are displayed in the Output window. To proceed to download the file or files, save (Ctrl+S ) the
JSON file.
You can right-click the folder node and then use the Refresh and Upload Blob commands on the shortcut
menu.
You can right-click the file node and then use the Preview/Edit, Download, Delete, Create EXTRACT
Script (available only for CSV, TSV, and TXT files), Copy Relative Path, and Copy Full Path commands
on the shortcut menu.
Additional features
Data Lake Tools for VS Code supports the following features:
IntelliSense autocomplete: Suggestions appear in pop-up windows around items like keywords,
methods, and variables. Different icons represent different types of objects:
Scala data type
Complex data type
Built-in UDTs
.NET collection and classes
C# expressions
Built-in C# UDFs, UDOs, and UDAAGs
U -SQL functions
U -SQL windowing functions
IntelliSense autocomplete on Data Lake Analytics metadata: Data Lake Tools downloads the Data
Lake Analytics metadata information locally. The IntelliSense feature automatically populates objects from
the Data Lake Analytics metadata. These objects include the database, schema, table, view, table-valued
function, procedures, and C# assemblies.
IntelliSense error marker: Data Lake Tools underlines editing errors for U -SQL and C#.
Syntax highlights: Data Lake Tools uses colors to differentiate items like variables, keywords, data types,
and functions.
NOTE
We recommend that you upgrade to Azure Data Lake Tools for Visual Studio version 2.3.3000.4 or later. The previous
versions are no longer available for download and are now deprecated.
Next steps
Develop U -SQL with Python, R, and C Sharp for Azure Data Lake Analytics in VS Code
U -SQL local run and local debug with Visual Studio Code
Tutorial: Get started with Azure Data Lake Analytics
Tutorial: Develop U -SQL scripts by using Data Lake Tools for Visual Studio
Develop U-SQL with Python, R, and C# for Azure
Data Lake Analytics in Visual Studio Code
3/15/2019 • 3 minutes to read • Edit Online
Learn how to use Visual Studio Code (VSCode) to write Python, R and C# code behind with U -SQL and submit
jobs to Azure Data Lake service. For more information about Azure Data Lake Tools for VSCode, see Use the
Azure Data Lake Tools for Visual Studio Code.
Before writing code-behind custom code, you need to open a folder or a workspace in VSCode.
NOTE
For best experiences on Python and R language service, please install VSCode Python and R extension.
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer("pythonSample.usql.py", pyVersion : "3.5.1");
OUTPUT @m
TO "/tweetmentions.csv"
USING Outputters.Csv();
3. Right-click a script file, and then select ADL: Generate Python Code Behind File.
4. The xxx.usql.py file is generated in your working folder. Write your code in Python file. The following is a
code sample.
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.
Develop R file
1. Click the New File in your workspace.
2. Write your code in U -SQL file. The following is a code sample.
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/LMPredictionsIris.txt";
DECLARE @PartitionCount int = 10;
@InputData =
EXTRACT SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
SepalLength,
SepalWidth,
PetalLength,
PetalWidth
FROM @InputData;
// Predict Species
@RScriptOutput =
REDUCE @ExtendedData
ON Par
PRODUCE Par,
fit double,
lwr double,
upr double
READONLY Par
USING new Extension.R.Reducer(scriptFile : "RClusterRun.usql.R", rReturnType : "dataframe",
stringsAsFactors : false);
OUTPUT @RScriptOutput
TO @OutputFilePredictions
USING Outputters.Tsv();
3. Right-click in USQL file, and then select ADL: Generate R Code Behind File.
4. The xxx.usql.r file is generated in your working folder. Write your code in R file. The following is a code
sample.
load("my_model_LM_Iris.rda")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval="confidence"))
5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.
Develop C# file
A code-behind file is a C# file associated with a single U -SQL script. You can define a script dedicated to UDO,
UDA, UDT, and UDF in the code-behind file. The UDO, UDA, UDT, and UDF can be used directly in the script
without registering the assembly first. The code-behind file is put in the same folder as its peering U -SQL script
file. If the script is named xxx.usql, the code-behind is named as xxx.usql.cs. If you manually delete the code-behind
file, the code-behind feature is disabled for its associated U -SQL script. For more information about writing
customer code for U -SQL script, see Writing and Using Custom Code in U -SQL: User-Defined Functions.
1. Click the New File in your workspace.
2. Write your code in U -SQL file. The following is a code sample.
@a =
EXTRACT
Iid int,
Starts DateTime,
Region string,
Query string,
DwellTime int,
Results string,
ClickedUrls string
FROM @"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@d =
SELECT DISTINCT Region
FROM @a;
@d1 =
PROCESS @d
PRODUCE
Region string,
Mkt string
USING new USQLApplication_codebehind.MyProcessor();
OUTPUT @d1
TO @"/output/SearchLogtest.txt"
USING Outputters.Tsv();
3. Right-click in USQL file, and then select ADL: Generate CS Code Behind File.
4. The xxx.usql.cs file is generated in your working folder. Write your code in CS file. The following is a code
sample.
namespace USQLApplication_codebehind
{
[SqlUserDefinedProcessor]
5. Right-click in USQL file, you can click Compile Script or Submit Job to running job.
Next steps
Use the Azure Data Lake Tools for Visual Studio Code
U -SQL local run and local debug with Visual Studio Code
Get started with Data Lake Analytics using PowerShell
Get started with Data Lake Analytics using the Azure portal
Use Data Lake Tools for Visual Studio for developing U -SQL applications
Use Data Lake Analytics(U -SQL ) catalog
Run U-SQL and debug locally in Visual Studio Code
9/14/2018 • 2 minutes to read • Edit Online
This article describes how to run U -SQL jobs on a local development machine to speed up early coding phases or
to debug code locally in Visual Studio Code. For instructions on Azure Data Lake Tool for Visual Studio Code, see
Use Azure Data Lake Tools for Visual Studio Code.
Only Windows installations of the Azure Data Lake Tools for Visual Studio support the action to run U -SQL
locally and debug U -SQL locally. Installations on macOS and Linux-based operating systems do not support this
feature.
2. Locate the dependency packages from the path shown in the Output pane, and then install BuildTools and
Win10SDK 10240. Here is an example path:
C:\Users\xxx\AppData\Roaming\LocalRunDependency
2.1 To install BuildTools, click visualcppbuildtools_full.exe in the LocalRunDependency folder, then follow
the wizard instructions.
2.2 To install Win10SDK 10240, click sdksetup.exe in the LocalRunDependency/Win10SDK_10.0.10240_2
folder, then follow the wizard instructions.
3. Set up the environment variable. Set the SCOPE_CPP_SDK environment variable to:
C:\Users\XXX\AppData\Roaming\LocalRunDependency\CppSDK_3rdparty
Start the local run service and submit the U-SQL job to a local account
For the first-time user, use ADL: Download Local Run Package to download local run packages, if you have
not set up U -SQL local run environment.
1. Select Ctrl+Shift+P to open the command palette, and then enter ADL: Start Local Run Service.
2. Select Accept to accept the Microsoft Software License Terms for the first time.
3. The cmd console opens. For first-time users, you need to enter 3, and then locate the local folder path for
your data input and output. For other options, you can use the default values.
4. Select Ctrl+Shift+P to open the command palette, enter ADL: Submit Job, and then select Local to
submit the job to your local account.
5. After you submit the job, you can view the submission details. To view the submission details, select
jobUrl in the Output window. You can also view the job submission status from the cmd console. Enter 7
in the cmd console if you want to know more job details.
Start a local debug for the U-SQL job
For the first-time user:
1. Use ADL: Download Local Run Package to download local run packages, if you have not set up U -SQL
local run environment.
2. Install .NET Core SDK 2.0 as suggested in the message box, if not installed.
3. Install C# for Visual Studio Code as suggested in the message box if not installed. Click Install to continue,
and then restart VSCode.
In this document, you learn how to orchestrate and create U -SQL jobs using SQL Server Integration Service
(SSIS ).
Prerequisites
Azure Feature Pack for Integration Services provides the Azure Data Lake Analytics task and the Azure Data Lake
Analytics Connection Manager that helps connect to Azure Data Lake Analytics service. To use this task, make sure
you install:
Download and install SQL Server Data Tools (SSDT) for Visual Studio
Install Azure Feature Pack for Integration Services (SSIS )
You can get the U -SQL script from different places by using SSIS built-in functions and tasks, below scenarios
show how can you configure the U -SQL scripts for different user cases.
Learn more about Azure Data Lake Store File System Task.
Configure Foreach Loop Container
1. In Collection page, set Enumerator to Foreach File Enumerator.
2. Set Folder under Enumerator configuration group to the temporary folder that includes the downloaded
U -SQL scripts.
3. Set Files under Enumerator configuration to *.usql so that the loop container only catches the files
ending with .usql .
4. In Variable Mappings page, add a user defined variable to get the file name for each U -SQL file. Set the
Index to 0 to get the file name. In this example, define a variable called User::FileName . This variable will be
used to dynamically get U -SQL script file connection and set U -SQL job name in Azure Data Lake Analytics
Task.
Configure Azure Data Lake Analytics Task
1. Set SourceType to FileConnection.
2. Set FileConnection to the file connection that points to the file objects returned from Foreach Loop
Container.
To create this file connection:
a. Choose <New Connection...> in FileConnection setting.
b. Set Usage type to Existing file, and set the File to any existing file's file path.
c. In Connection Managers view, right-click the file connection created just now, and choose
Properties.
d. In the Properties window, expand Expressions, and set ConnectionString to the variable defined
in Foreach Loop Container, for example, @[User::FileName] .
3. Set AzureDataLakeAnalyticsConnection to the Azure Data Lake Analytics account that you want to
submit jobs to. Learn more about Azure Data Lake Analytics Connection Manager.
4. Set other job configurations. Learn More.
5. Use Expressions to dynamically set U -SQL job name:
a. In Expressions page, add a new expression key-value pair for JobName.
b. Set the value for JobName to the variable defined in Foreach Loop Container, for example,
@[User::FileName] .
Scenario 3-Use U-SQL files in Azure Blob Storage
You can use U -SQL files in Azure Blob Storage by using Azure Blob Download Task in Azure Feature Pack. This
approach enables you using the scripts on cloud.
The steps are similar with Scenario 2: Use U -SQL files in Azure Data Lake Store. Change the Azure Data Lake
Store File System Task to Azure Blob Download Task. Learn more about Azure Blob Download Task.
The control flow is like below.
Next steps
Run SSIS packages in Azure
Azure Feature Pack for Integration Services (SSIS )
Schedule U -SQL jobs using Azure Data Factory
How to set up a CI/CD pipeline for Azure Data Lake
Analytics
4/12/2019 • 14 minutes to read • Edit Online
In this article, you learn how to set up a continuous integration and deployment (CI/CD ) pipeline for U -SQL jobs
and U -SQL databases.
NOTE
This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will
continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM
compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure
PowerShell.
<!-- check for SDK Build target in current path then in USQLSDKPath-->
<Import Project="UsqlSDKBuild.targets" Condition="Exists('UsqlSDKBuild.targets')" />
<Import Project="$(USQLSDKPath)\UsqlSDKBuild.targets" Condition="!Exists('UsqlSDKBuild.targets') And
'$(USQLSDKPath)' != '' And Exists('$(USQLSDKPath)\UsqlSDKBuild.targets')" />
NOTE
DROP statement may cause accident deletion issue. To enable DROP statement, you need to explicitly specify the MSBuild
arguments. AllowDropStatement will enable non-data related DROP operation, like drop assembly and drop table valued
function. AllowDataDropStatement will enable data related DROP operation, like drop table and drop schema. You have
to enable AllowDropStatement before using AllowDataDropStatement.
msbuild USQLBuild.usqlproj
/p:USQLSDKPath=packages\Microsoft.Azure.DataLake.USQL.SDK.1.3.180615\build\runtime;USQLTargetType=SyntaxCheck
;DataRoot=datarootfolder;/p:EnableDeployment=true
/p:USQLSDKPath=$(Build.SourcesDirectory)/packages/Microsoft.Azure.DataLake.USQL.SDK.1.3.180615/build/ru
ntime /p:USQLTargetType=SyntaxCheck /p:DataRoot=$(Build.SourcesDirectory) /p:EnableDeployment=true
NOTE
Code-behind files for each U-SQL script will be merged as an inline statement to the script build output.
<#
This script can be used to submit U-SQL Jobs with given U-SQL project build output(.usqlpack file).
This will unzip the U-SQL project build output, and submit all scripts one-by-one.
Note: the code behind file for each U-SQL script will be merged into the built U-SQL script in build
output.
Example :
USQLJobSubmission.ps1 -ADLAAccountName "myadlaaccount" -ArtifactsRoot "C:\USQLProject\bin\debug\" -
DegreeOfParallelism 2
#>
param(
[Parameter(Mandatory=$true)][string]$ADLAAccountName, # ADLA account name to submit U-SQL jobs
[Parameter(Mandatory=$true)][string]$ArtifactsRoot, # Root folder of U-SQL project build output
[Parameter(Mandatory=$false)][string]$DegreeOfParallelism = 1
)
return $USQLFiles
}
# Submit each usql script and wait for completion before moving ahead.
foreach ($usqlFile in $usqlFiles)
{
$scriptName = "[Release].[$([System.IO.Path]::GetFileNameWithoutExtension($usqlFile.fullname))]"
LogJobInformation $jobToSubmit
Function LogJobInformation($jobInfo)
{
Write-Output "************************************************************************"
Write-Output ([string]::Format("Job Id: {0}", $(DefaultIfNull $jobInfo.JobId)))
Write-Output ([string]::Format("Job Name: {0}", $(DefaultIfNull $jobInfo.Name)))
Write-Output ([string]::Format("Job State: {0}", $(DefaultIfNull $jobInfo.State)))
Write-Output ([string]::Format("Job Started at: {0}", $(DefaultIfNull $jobInfo.StartTime)))
Write-Output ([string]::Format("Job Ended at: {0}", $(DefaultIfNull $jobInfo.EndTime)))
Write-Output ([string]::Format("Job Result: {0}", $(DefaultIfNull $jobInfo.Result)))
Write-Output "************************************************************************"
}
Function DefaultIfNull($item)
{
if ($item -ne $null)
{
return $item
}
return ""
}
Function Main()
{
Write-Output ([string]::Format("ADLA account: {0}", $ADLAAccountName))
Write-Output ([string]::Format("Root folde for usqlpack: {0}", $ArtifactsRoot))
Write-Output ([string]::Format("AU count: {0}", $DegreeOfParallelism))
SubmitAnalyticsJob
Main
Example :
FileUpload.ps1 -ADLSName "myadlsaccount" -ArtifactsRoot "C:\USQLProject\bin\debug\"
#>
param(
[Parameter(Mandatory=$true)][string]$ADLSName, # ADLS account name to upload U-SQL scripts
[Parameter(Mandatory=$true)][string]$ArtifactsRoot, # Root folder of U-SQL project build output
[Parameter(Mandatory=$false)][string]$DestinationFolder = "USQLScriptSource" # Destination folder in ADLS
)
Function UploadResources()
{
Write-Host "************************************************************************"
Write-Host "Uploading files to $ADLSName"
Write-Host "***********************************************************************"
$usqlScripts = GetUsqlFiles
Function GetUsqlFiles()
{
return Get-ChildItem -Path $UnzipOutput -Include *.usql -File -Recurse -ErrorAction SilentlyContinue
}
UploadResources
msbuild DatabaseProject.usqldbproj
/p:USQLSDKPath=packages\Microsoft.Azure.DataLake.USQL.SDK.1.3.180615\build\runtime
The argument USQLSDKPath=<U-SQL Nuget package>\build\runtime refers to the install path of the NuGet package
for the U -SQL language service.
Continuous integration with Azure Pipelines
In addition to the command line, you can use Visual Studio Build or an MSBuild task to build U -SQL database
projects in Azure Pipelines. To set up a build task, make sure to add two tasks in the build pipeline: a NuGet
restore task and an MSBuild task.
1. Add a NuGet restore task to get the solution-referenced NuGet package, which includes
Azure.DataLake.USQL.SDK , so that MSBuild can find the U -SQL language targets. Set Advanced >
Destination directory to $(Build.SourcesDirectory)/packages if you want to use the MSBuild arguments
sample directly in step 2.
2. Set MSBuild arguments in Visual Studio build tools or in an MSBuild task as shown in the following
example. Or you can define variables for these arguments in the Azure Pipelines build pipeline.
/p:USQLSDKPath=$(Build.SourcesDirectory)/packages/Microsoft.Azure.DataLake.USQL.SDK.1.3.180615/build/ru
ntime
NOTE
PowerShell command-line support and Azure Pipelines release task support for U-SQL database deployment is currently
pending.
Take the following steps to set up a database deployment task in Azure Pipelines:
1. Add a PowerShell Script task in a build or release pipeline and execute the following PowerShell script.
This task helps to get Azure SDK dependencies for PackageDeploymentTool.exe and
PackageDeploymentTool.exe . You can set the -AzureSDK and -DBDeploymentTool parameters to load the
dependencies and deployment tool to specific folders. Pass the -AzureSDK path to
PackageDeploymentTool.exe as the -AzureSDKPath parameter in step 2.
<#
This script is used for getting dependencies and SDKs for U-SQL database deployment.
PowerShell command line support for deploying U-SQL database package(.usqldbpack file) will come
soon.
Example :
GetUSQLDBDeploymentSDK.ps1 -AzureSDK "AzureSDKFolderPath" -DBDeploymentTool
"DBDeploymentToolFolderPath"
#>
param (
[string]$AzureSDK = "AzureSDK", # Folder to cache Azure SDK dependencies
[string]$DBDeploymentTool = "DBDeploymentTool", # Folder to cache U-SQL database deployment tool
[string]$workingfolder = "" # Folder to execute these command lines
)
if ([string]::IsNullOrEmpty($workingfolder))
{
$scriptpath = $MyInvocation.MyCommand.Path
$workingfolder = Split-Path $scriptpath
}
cd $workingfolder
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.Management.DataLake.Analytics/3.5.1-preview -
outf Microsoft.Azure.Management.DataLake.Analytics.3.5.1-preview.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.Management.DataLake.Store/2.4.1-preview -outf
Microsoft.Azure.Management.DataLake.Store.2.4.1-preview.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.IdentityModel.Clients.ActiveDirectory/2.28.3 -outf
Microsoft.IdentityModel.Clients.ActiveDirectory.2.28.3.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime/2.3.11 -outf
Microsoft.Rest.ClientRuntime.2.3.11.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime.Azure/3.3.7 -outf
Microsoft.Rest.ClientRuntime.Azure.3.3.7.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Rest.ClientRuntime.Azure.Authentication/2.3.3 -outf
Microsoft.Rest.ClientRuntime.Azure.Authentication.2.3.3.zip
iwr https://www.nuget.org/api/v2/package/Newtonsoft.Json/6.0.8 -outf Newtonsoft.Json.6.0.8.zip
iwr https://www.nuget.org/api/v2/package/Microsoft.Azure.DataLake.USQL.SDK/ -outf USQLSDK.zip
2. Add a Command-Line task in a build or release pipeline and fill in the script by calling
PackageDeploymentTool.exe . PackageDeploymentTool.exe is located under the defined
$DBDeploymentTool folder. The sample script is as follows:
Deploy a U -SQL database locally:
Use interactive authentication mode to deploy a U -SQL database to an Azure Data Lake Analytics
account:
PackageDeploymentTool.exe deploycluster -Package <package path> -Database <database name> -
Account <account name> -ResourceGroup <resource group name> -SubscriptionId <subscript id> -
Tenant <tenant name> -AzureSDKPath <azure sdk path> -Interactive
Use secrete authentication to deploy a U -SQL database to an Azure Data Lake Analytics account:
Use certFile authentication to deploy a U -SQL database to an Azure Data Lake Analytics account:
SecreteFile The file saves the secrete or null Required for non-interactive
password for non- authentication, or else use
interactive authentication. Secrete.
Make sure to keep it
readable only by the current
user.
Next steps
How to test your Azure Data Lake Analytics code.
Run U -SQL script on your local machine.
Use U -SQL database project to develop U -SQL database.
Best practices for managing U-SQL assemblies in a
CI/CD pipeline
3/11/2019 • 3 minutes to read • Edit Online
In this article, you learn how to manage U -SQL assembly source code with the newly introduced U -SQL database
project. You also learn how to set up a continuous integration and deployment (CI/CD ) pipeline for assembly
registration by using Azure DevOps.
4. Add a reference to the C# class library project for the U -SQL database project.
5. Create an assembly script in the U -SQL database project by right-clicking the project and selecting Add
New Item.
6. Open the assembly script in the assembly design view. Select the referenced assembly from the Create
assembly from reference drop-down menu.
7. Add Managed Dependencies and Additional Files, if there are any. When you add additional files, the
tool uses the relative path to make sure it can find the assemblies on your local machine and on the build
machine later.
@_DeployTempDirectory in the editor window at the bottom is a predefined variable that points the tool to the
build output folder. Under the build output folder, every assembly has a subfolder named with the assembly name.
All DLLs and additional files are in that subfolder.
In Azure DevOps, you can use a command-line task and this SDK to set up an automation pipeline for the U -SQL
database refresh. Learn more about the SDK and how to set up a CI/CD pipeline for U -SQL database deployment.
Next steps
Set up a CI/CD pipeline for Azure Data Lake Analytics
Test your Azure Data Lake Analytics code
Run U -SQL script on your local machine
Test your Azure Data Lake Analytics code
11/1/2018 • 4 minutes to read • Edit Online
Azure Data Lake provides the U -SQL language, which combines declarative SQL with imperative C# to process
data at any scale. In this document, you learn how to create test cases for U -SQL and extended C# UDO (user-
defined operator) code.
Test C# UDOs
Create test cases for C# UDOs
You can use a C# unit test framework to test your C# UDOs (user-defined operators). When testing UDOs, you
need to prepare corresponding IRowset objects as inputs.
There are two ways to create an IRowset object:
Load data from a file to create IRowset:
Next steps
How to set up CI/CD pipeline for Azure Data Lake Analytics
Run U -SQL script on your local machine
Use U -SQL database project to develop U -SQL database
Run and test U-SQL with Azure Data Lake U-SQL
SDK
3/15/2019 • 11 minutes to read • Edit Online
When developing U -SQL script, it is common to run and test U -SQL script locally before submit it to cloud. Azure
Data Lake provides a Nuget package called Azure Data Lake U -SQL SDK for this scenario, through which you can
easily scale U -SQL run and test. It is also possible to integrate this U -SQL test with CI (Continuous Integration)
system to automate the compile and test.
If you care about how to manually local run and debug U -SQL script with GUI tooling, then you can use Azure
Data Lake Tools for Visual Studio for that. You can learn more from here.
/abc/def/input.csv C:\LocalRunDataRoot\abc\def\input.csv
abc/def/input.csv C:\LocalRunDataRoot\abc\def\input.csv
D:/abc/def/input.csv D:\abc\def\input.csv
Working directory
When running the U -SQL script locally, a working directory is created during compilation under current running
directory. In addition to the compilation outputs, the needed runtime files for local execution will be shadow copied
to this working directory. The working directory root folder is called "ScopeWorkDir" and the files under the
working directory are as follows:
Run LocalRunHelper.exe without arguments or with the help switch to show the help information:
> LocalRunHelper.exe help
Define a new environment variable called SCOPE_CPP_SDK to point to this directory. Or copy the folder
to the other location and specify SCOPE_CPP_SDK as that.
In addition to setting the environment variable, you can specify the -CppSDK argument when you're using
the command line. This argument overwrites your default CppSDK environment variable.
Set the LOCALRUN_DATAROOT environment variable.
Define a new environment variable called LOCALRUN_DATAROOT that points to the data root.
In addition to setting the environment variable, you can specify the -DataRoot argument with the data-root
path when you're using a command line. This argument overwrites your default data-root environment
variable. You need to add this argument to every command line you're running so that you can overwrite
the default data-root environment variable for all operations.
SDK command line usage samples
Compile and run
The run command is used to compile the script and then execute compiled results. Its command-line arguments
are a combination of those from compile and execute.
Here's an example:
Besides combining compile and execute, you can compile and execute the compiled executables separately.
Compile a U-SQL script
The compile command is used to compile a U -SQL script to executables.
ARGUMENT DESCRIPTION
-CodeBehind [default value 'False'] The script has .cs code behind
ARGUMENT DESCRIPTION
-DataRoot [default value 'DataRoot environment variable'] DataRoot for local run, default to 'LOCALRUN_DATAROOT'
environment variable
-References [default value ''] List of paths to extra reference assemblies or data files of code
behind, separated by ';'
-UseDatabase [default value 'master'] Database to use for code behind temporary assembly
registration
-WorkDir [default value 'Current Directory'] Directory for compiler usage and outputs
-ScopeCEPTempPath [default value 'temp'] Temp path to use for streaming data
Compile a U -SQL script and set the data-root folder. Note that this will overwrite the set environment variable.
Compile a U -SQL script and set a working directory, reference assembly, and database:
U -SQL SDK only support x64 environment, make sure to set build platform target as x64. You can set that
through Project Property > Build > Platform target.
Make sure to set your test environment as x64. In Visual Studio, you can set it through Test > Test Settings
> Default Processor Architecture > x64.
Make sure to copy all dependency files under NugetPackage\build\runtime\ to project working directory
which is usually under ProjectFolder\bin\x64\Debug.
Step 2: Create U -SQL script test case
Below is the sample code for U -SQL script test. For testing, you need to prepare scripts, input files and expected
output files.
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.IO;
using System.Text;
using System.Security.Cryptography;
using Microsoft.Analytics.LocalRun;
namespace UnitTestProject1
{
[TestClass]
public class USQLUnitTest
{
[TestMethod]
public void TestUSQLScript()
{
//Specify the local run message output path
StreamWriter MessageOutput = new StreamWriter("../../../log.txt");
//Script output
//Script output
string Result = Path.Combine(localrun.DataRoot, "Output/result.csv");
Test.Helpers.FileAssert.AreEqual(Result, ExpectedResult);
namespace Test.Helpers
{
public static class FileAssert
{
static string GetFileHash(string filename)
{
Assert.IsTrue(File.Exists(filename));
Assert.AreEqual(hash1, hash2);
}
}
}
Properties
PROPERTY TYPE DESCRIPTION
Method
METHOD DESCRIPTION RETURN PARAMETER
public bool Check if the given path is True for valid The path of runtime
IsValidRuntimeDir(string valid runtime path directory
path)
Next steps
To learn U -SQL, see Get started with Azure Data Lake Analytics U -SQL language.
To log diagnostics information, see Accessing diagnostics logs for Azure Data Lake Analytics.
To see a more complex query, see Analyze website logs using Azure Data Lake Analytics.
To view job details, see Use Job Browser and Job View for Azure Data Lake Analytics jobs.
To use the vertex execution view, see Use the Vertex Execution View in Data Lake Tools for Visual Studio.