Welcome back! This post is part of a series walking through how to set up a basic copy data pipeline in Azure Data Factory. So far, we’ve walked through our first two steps creating a SQL Database Linked Service and creating an Azure Data Lake Gen2 Storage Account Linked Service. In these posts, we explored that a linked service was simply a connection to a data source and/or a destination. Now we are going to explore the concept of a dataset in Azure Data Factory and create one for each of our linked services.
What is a Dataset, and Where Does it Fit in?
A dataset is a virtual representation of a data item that is stored in a Linked Service. Datasets are a crucial pillar to Azure Data Factory, as, without them, we would not be able to save and extract data from the linked services we specified. This blog will explore how to configure a dataset to choose a specific table from a SQL Database and how to read in CSV and Parquet files from an Azure Data Lake Gen2 Storage account.
Datasets are a crucial pillar to Azure Data Factory, as, without them, we would not be able to save and extract data from the linked services we specified
How to create a Dataset from an Azure Data Lake Gen2 Storage Account Linked Service
1) Select the subfolder you would like to create your dataset under, hover over the ellipses, and select ‘Create new Dataset.’
2) You should see a list of all the datasets Azure Data Factory supports. Type in ‘Data Lake’ in the search bar. Two options will come up, Azure Data lake Storage Gen1 and Azure Data lake Storage Gen2. Select the Gen2 option.
3) Once Selected Gen2, select the file format type you want to save your data as the Parquet file format. It is highly recommended to utilize this file form, when possible, to store data. Parquet file run-length encoding and indexing your data reduces its size, making it faster to read.
4) After selecting Parquet, we are asked to pick a linked service. Select the linked service we created in one of my previous posts.
5) Once the linked service is selected, we are prompted to enter a file path to our data set’s location. There is a small browser icon to the right of the input boxes. If you click that, you can navigate through your storage account and select the destination where your file should be saved.
6) Once our directory location is selected, we can click ‘create,’ give our dataset a friendly filename and save our changes. We have now configured a dataset written to an Azure Data Lake Gen2 Storage Account.
If you would like to explore options in partnering with Tallan to help build out your businesses cloud data analytics platform, please reach out to me at Conner.Wulf@tallan.com or connect on LinkedIn.
Click here to view all of Tallan’s latest offerings, and find what’s right for your organization.