This series of posts is intended to introduce the uninitiated SQL Server professional to the data warehouse in Azure Synapse Analytics. This data warehouse was formerly known as Azure SQL Data Warehouse, distinct from Azure SQL Database. Further in the past the Azure SQL Data Warehouse was implemented as the PDW (Parallel Data Warehouse) appliance. This lineage, while not immediately apparent, can be discerned by looking at the names of the dmv’s, most of which start with ‘dm_pdw_’. We will explore some architectural, design and programmatic considerations in this series of posts. This post is the first of three and focuses on the architectural differences.
Much of the work this content is based on was done prior to Microsoft’s update, moving from Azure SQL Data Warehouse to Azure Synapse Analytics. To be clear this blog is focused on the SQLPool aspect of Azure Synapse Analytics. Azure Synapse Analytics Serverless (On Demand) is outside the scope of this discussion.
This first post reviews the differences in the architectures of Azure SQL Database and the Azure Synapse Analytics data warehouse (DW-SQL Pools).
Azure SQL Database (and traditional versions of SQL Server) utilizes computers based on the SMP model. Ultimately, this means that SQL Server is installed on a single computer/server that may (and probably does) have multiple CPU’s and cores and therefore permits parallel processing. However, there is only one database that is doing all the work to support requests.
The data warehouse provided in Azure Synapse Analytics (and its’ antecedents) is built on a Massively Parallel Processing architecture. In this case, multiple computers/servers (referred to as nodes) with dedicated processors are deployed, all with SQL Server installed. Each instance has its own processors, memory, and dedicated storage. One node in this scenario is the control node. The purpose of the control node is to distribute requests to worker nodes. The worker nodes do all work that can be done locally and report the results to the control node. The control node then performs any additional work on the combined results from all worker nodes and reports the results to the requestor.
Azure SQL Database is a relational database-as-a service using the Microsoft SQL Server Engine. SQL Database is a high-performance, reliable, and secure database you can use to build data-driven applications and websites in the programming language of your choice, without needing to manage infrastructure.
Azure Synapse Analytics Data Warehouse is a massively parallel processing (MPP) cloud-based, scale-out, relational database capable of processing massive volumes of data.
Size of Data
The size of a database in Synapse Analytics DW is virtually unlimited. The only limit is your wallet and blob storage is inexpensive. Further if clustered columnstore indexes are leveraged, the compression of data at around 10% permits even larger databases to be stored.
Because each shard is connected to its own blob storage, any limits imposed by Azure SQL Server on database size are mitigated.
In a well-designed Synapse Analytics DW implementation, every shard will be independently operating on its own set of data in parallel with the other shards.
SMP vs MPP
As we can see in the above illustrations, there are substantial differences in how work will be accomplished in each of these architectures. Central to the MPP architecture is the shard. What is a shard? For our purposes here, it is one of the SQL compute nodes and its associated blob storage, referred to above as a worker node. Shards are horizontal partitions of data. It is possible that some data exists on all shards, but the typical usage is for a shard to own a subset of the overall data. Every shard is held on one database server. Each shard acts as the single source for a subset of data, determined by the distribution chosen.
There are different design considerations as well as limitations in t-SQL in MPP compared to SMP, all of which we will explore further. The focus will be on what needs to be considered in Azure Synapse Analytics data warehouse when one is accustomed to working with any of the other flavors of MS SQL Server.
Declarative Referential Integrity
Referential integrity constraints cannot be applied in the Synapse Analytics data warehouse due to their being:
- No primary keys
- No foreign keys
- No unique constraints
There are probably several reasons for this, but they likely center on performance considerations, particularly if enforcement requires the database to visit other shards to determine if there is a violation. If uniqueness or existence must be enforced, then it must be done in code.
Additionally, it is possible that the engineers of the original PDW may have decided that the data coming to the data warehouse should have already been cleansed, deduped, etc. A cost/benefit analysis of applying referential integrity may have been decided in favor of not implementing the constraints.
The distribution of a table in an MPP architecture defines how data will be distributed among the shards.
The distribution of a table in an MPP architecture defines how data will be distributed among the shards. While this can be considered a partitioning of the data, it is not the same as the partitioning of data that SQL Server supports. Synapse Analytics also supports that method of partitioning. Three methods are available for defining the distribution of data in a shard of a table:
As the name suggests, the control node distributes data sequentially through the shards. By definition, data is well distributed among each of the worker nodes.
For a distribution that is set to REPLICATE, a copy of the data is retained on every shard. This is generally most applicable to reference tables where the data is ‘small’. Movement of data to the shards to satisfy join lookups is avoided unless the looked-up data is modified. Data in a replicated table is maintained on the control node and copied as needed to the individual shards.
In the HASH distribution, the data is organized among the shards based on a hashed value of the column passed as a parameter to the hash function. Only one column may be used in the HASH() function. There are many considerations when choosing the column to hash. The choice should allow as many operations as possible to be pushed to the individual shards such that they can be performed in parallel. For more information on making the choice feel free to reach out to the author.
One can easily convert between a clustered columnstore index and a heap using the normal CREATE and DROP statements.
Indexing for a sharded database generally follows the rules one would apply to a traditional database with a few exceptions. A table should be expected to contain at least 1 million records per shard for a clustered columnstore index to be effective. For most Synapse Analytics data warehouses that means 60 million records. Otherwise, a heap should be chosen. Ordinary indexes, clustered and nonclustered may be defined. Unique indexes are not permitted.
One can easily convert between a clustered columnstore index and a heap using the normal CREATE and DROP statements. If one desires a clustered columnstore index that is named according to a naming convention, the table must be created as a heap and then converted to a clustered columnstore index
The distribution of data in a table that is hashed on a column can be checked as follows:
It may be determined that the distribution of records is inadequate. Alternatives can be quickly compared by generating additional versions of the table in question using CTAS (CREATE TABLE AS SELECT) to create tables with differing distributions.
Microsoft’s documentation also provides information on determining skew within the shards. A view definition serves as the foundation for determining skew if one chooses to gather more detailed information.
A Word on Identities
Defining an identity on a column is permitted. Several facts about the identities in Synapse Analytics data warehouses need to be understood.
First, it appears that every shard has its own set of numbers available to it. The seed on each shard is initialized to a unique number. Presuming 60 shards and a starting value of 1, then the seeds on the shards will be 1 through 60. Every subsequent value adds the # of shards to the last value created on that shard. Shard 1 might have the values 1, 61, 121, etc. available to it. Consequently, with a hashed distribution, there is no guarantee that the values generated are in sequence and there will usually be holes in the sequence. The first insert into a table with an identity may generate a row with a number between 1 and 60 under the circumstances stated above, 47 for instance. The second insert may insert a row with a value between 1 and 60 but not 47 or 107 depending on whether the row is placed on the same shard as the first insert or not.
Second, an identity column cannot be used for the hashing distribution. Values must be known to the control node’s hashing algorithm prior to directing the row to a shard. The value is determined on the shard so the control node cannot know how to direct the row.
Azure Synapse Analytics DW can provide scale and performance querying massive quantities of data if done right. Because the architectures diverge in fundamental ways, the rules are different than what one would apply in designing for the Azure SQL Database. Stay tuned for two additional posts that discuss design and t-SQL differences between Azure Synapse Analytics DW and Azure SQL Database implementations.