PolyBase and Big Data Clusters

Courses
- Microsoft
  - All Microsoft
  - Azure
    - Trending Courses (Azure)
  - Power BI
    - Trending Courses (Power BI)
  - Power Platform
    - Trending Courses (Power Platform)
  - Dynamics 365
    - Trending Courses (Dynamics 365)
  - Windows 10
    - Trending Courses (Windows 10)
  - Microsoft 365
    - Trending Courses (Microsoft 365)
  - Security Engineer
    - Trending Courses (Security Engineer)
  - Microsoft Teams
    - Trending Courses (Microsoft Teams)
  - SQL Server
    - Trending Courses (SQL Server)
  - Exchange Server
    - Trending Courses (Exchange Server)
  - Solution Architect
    - Trending Courses (Solution Architect)
  - Trending Courses (Microsoft)
- Oracle
  - All Oracle
  - EBS Functional
    - Trending Courses (EBS Functional)
  - EBS Technical
    - Trending Courses (EBS Technical)
  - EBS SCM
    - Trending Courses (EBS SCM)
  - Fusion
    - Trending Courses (Fusion)
  - Financials Cloud
    - Trending Courses (Financials Cloud)
  - HCM Cloud
    - Trending Courses (HCM Cloud)
  - SCM Cloud
    - Trending Courses (SCM Cloud)
  - Sales Cloud
    - Trending Courses (Sales Cloud)
  - Procurement
    - Trending Courses (Procurement)
  - Database 19C
    - Trending Courses (Database 19C)
  - EDM
    - Trending Courses (EDM)
  - EPM
    - Trending Courses (EPM)
  - OTM
    - Trending Courses (OTM)
  - Fusion HCM
    - Trending Courses (Fusion HCM)
  - Fusion SCM
    - Trending Courses (Fusion SCM)
  - Fusion Cloud
    - Trending Courses (Fusion Cloud)
  - DRM
    - Trending Courses (DRM)
  - Apps DBA
    - Trending Courses (Apps DBA)
  - BPM
    - Trending Courses (BPM)
  - Golden Gate
    - Trending Courses (Golden Gate)
  - Exadata
    - Trending Courses (Exadata)
  - EBS
    - Trending Courses (EBS)
  - EPROC
    - Trending Courses (EPROC)
  - FCCS
    - Trending Courses (FCCS)
  - TRCS
    - Trending Courses (TRCS)
  - EPBCS
    - Trending Courses (EPBCS)
  - GTM
    - Trending Courses (GTM)
  - HRMS
    - Trending Courses (HRMS)
  - 12C
    - Trending Courses (12C)
  - PCMCS
    - Trending Courses (PCMCS)
  - Trending Courses (Oracle)
- Cisco
  - All Cisco
  - Routing & Switching
    - Trending Courses (Routing & Switching)
  - Security
    - Trending Courses (Security)
  - SD WAN
    - Trending Courses (SD WAN)
  - Enterprise
    - Trending Courses (Enterprise)
  - Meraki
    - Trending Courses (Meraki)
  - NGFW
    - Trending Courses (NGFW)
  - Network Automation
    - Trending Courses (Network Automation)
  - Cisco DevOps
    - Trending Courses (Cisco DevOps)
  - Wireless
    - Trending Courses (Wireless)
  - Cisco DevNet
    - Trending Courses (Cisco DevNet)
  - Trending Courses (Cisco)
- AWS
  - All AWS
  - Architect
    - Trending Courses (Architect)
  - Data Analytics
    - Trending Courses (Data Analytics)
  - AWS Data Science
    - Trending Courses (AWS Data Science)
  - AWS Cloud
    - Trending Courses (AWS Cloud)
  - AWS Machine Learning
    - Trending Courses (AWS Machine Learning)
  - AWS Development
    - Trending Courses (AWS Development)
  - AWS Security
    - Trending Courses (AWS Security)
  - AWS DevOps
    - Trending Courses (AWS DevOps)
  - Data Warehouse
    - Trending Courses (Data Warehouse)
  - Trending Courses (AWS)
- VMware
  - All VMware
  - vSphere
    - Trending Courses (vSphere)
  - NSX T
    - Trending Courses (NSX T)
  - vRealize
    - Trending Courses (vRealize)
  - Tanzu
    - Trending Courses (Tanzu)
  - vSAN
    - Trending Courses (vSAN)
  - Workspace ONE
    - Trending Courses (Workspace ONE)
  - Horizon
    - Trending Courses (Horizon)
  - Kubernetes
    - Trending Courses (Kubernetes)
  - Automation
    - Trending Courses (Automation)
  - Trending Courses (VMware)
- ISACA
  - All ISACA
  - Cyber Security
    - Trending Courses (Cyber Security)
  - COBIT
    - Trending Courses (COBIT)
  - IT Governance
    - Trending Courses (IT Governance)
  - IT Fundamentals
    - Trending Courses (IT Fundamentals)
  - Data Engineer
    - Trending Courses (Data Engineer)
  - Trending Courses (ISACA)
- AXELOS
  - All AXELOS
  - ITIL
    - Trending Courses (ITIL)
  - PRINCE2
    - Trending Courses (PRINCE2)
  - MSP
    - Trending Courses (MSP)
  - M o R
    - Trending Courses (M o R)
  - Trending Courses (AXELOS)
- PECB
  - All PECB
  - ISO Risk Manager
    - Trending Courses (ISO Risk Manager)
  - ISO Lead Implementer
    - Trending Courses (ISO Lead Implementer)
  - ISO Lead Auditor
    - Trending Courses (ISO Lead Auditor)
  - GDPR
    - Trending Courses (GDPR)
  - SCADA
    - Trending Courses (SCADA)
  - ISO Foundation
    - Trending Courses (ISO Foundation)
  - Trending Courses (PECB)
- EC Council
  - All EC Council
  - Ethical Hacking
    - Trending Courses (Ethical Hacking)
  - Security Operations Center
    - Trending Courses (Security Operations Center)
  - Penetration Testing
    - Trending Courses (Penetration Testing)
  - Security Engineers
    - Trending Courses (Security Engineers)
  - Security Testing
    - Trending Courses (Security Testing)
  - SIEM
    - Trending Courses (SIEM)
  - Disaster Recovery
    - Trending Courses (Disaster Recovery)
  - Block chain
    - Trending Courses (Block chain)
  - CyberSecurity
    - Trending Courses (CyberSecurity)
  - Trending Courses (EC Council)
- CompTIA
  - All CompTIA
  - CompTIA Cyber Security
    - Trending Courses (CompTIA Cyber Security)
  - CompTIA Networking
    - Trending Courses (CompTIA Networking)
  - CompTIA Data Center
    - Trending Courses (CompTIA Data Center)
  - IT Support & Help Desk
    - Trending Courses (IT Support & Help Desk)
  - CompTIA Project Management
    - Trending Courses (CompTIA Project Management)
  - CompTIA Penetration Testing
    - Trending Courses (CompTIA Penetration Testing)
  - CompTIA Data Analytics
    - Trending Courses (CompTIA Data Analytics)
  - Linux
    - Trending Courses (Linux)
  - Trending Courses (CompTIA)
- PMI
  - All PMI
  - Agile
    - Trending Courses (Agile)
  - Business Process Management
    - Trending Courses (Business Process Management)
  - PMI Project Management
    - Trending Courses (PMI Project Management)
  - Business Analysis
    - Trending Courses (Business Analysis)
  - Program Management
    - Trending Courses (Program Management)
  - Trending Courses (PMI)
- Red Hat
  - All Red Hat
  - Ansible
    - Trending Courses (Ansible)
  - System Administration
    - Trending Courses (System Administration)
  - Cloud Storage
    - Trending Courses (Cloud Storage)
  - Red Hat Linux
    - Trending Courses (Red Hat Linux)
  - JBoss
    - Trending Courses (JBoss)
  - Virtualization
    - Trending Courses (Virtualization)
  - Red Hat Security
    - Trending Courses (Red Hat Security)
  - Red Hat OpenStack
    - Trending Courses (Red Hat OpenStack)
  - Red Hat OpenShift
    - Trending Courses (Red Hat OpenShift)
  - RHEL
    - Trending Courses (RHEL)
  - Trending Courses (Red Hat)
- Coupa
  - All Coupa
  - Coupa
    - Trending Courses (Coupa)
  - Trending Courses (Coupa)
- Business Rule Engine
  - All Business Rule Engine
  - Drools
    - Trending Courses (Drools)
  - Trending Courses (Business Rule Engine)
- DevOps Institute
  - All DevOps Institute
  - SRE
    - Trending Courses (SRE)
  - Risk Management
    - Trending Courses (Risk Management)
  - VSM Foundation
    - Trending Courses (VSM Foundation)
  - Agile DevOps
    - Trending Courses (Agile DevOps)
  - DeveOps
    - Trending Courses (DeveOps)
  - Trending Courses (DevOps Institute)
- Explore All

Microsoft Articles

Table of Contents

In today's digital age, data is being generated at an unprecedented pace, and organizations need to harness the power of big data to drive business insights and better decision-making. To achieve this, companies require robust data management solutions that can integrate with diverse data sources and scale as data volumes grow.

Microsoft's SQL Server is a popular choice for enterprise data management due to its reliability, security, and scalability. SQL Server has evolved over the years to support a variety of workloads, including traditional relational databases, business intelligence, and big data analytics. In this blog post, we'll explore how SQL Server's PolyBase and Big Data Clusters features enable seamless integration with diverse data sources, including Hadoop and Apache Spark.

1. Understanding PolyBase

PolyBase is a feature introduced in SQL Server 2016 that enables seamless integration with external data sources, including Hadoop, Azure Blob Storage, and SQL Server instances. With PolyBase, SQL Server can query data from these sources as if they were native tables within SQL Server, without the need for additional ETL tools or complex data transformation processes.

PolyBase uses T-SQL queries to access external data sources, and it supports both relational and non-relational data formats. For example, PolyBase can query data stored in Hadoop Distributed File System (HDFS), Apache Hive, and Azure Data Lake Storage Gen1/Gen2. PolyBase also supports querying semi-structured data formats such as JSON, Parquet, and ORC.

1.1 PolyBase Architecture

PolyBase is built on top of SQL Server's distributed query processing engine, which enables the execution of distributed queries across multiple SQL Server instances and external data sources. PolyBase architecture consists of two main components: the PolyBase Engine and the PolyBase Data Movement Service.

The PolyBase Engine is responsible for parsing T-SQL queries that involve external data sources and generating execution plans that can be executed across distributed nodes. The PolyBase Engine also includes a cost-based optimizer that selects the most efficient query execution plan based on the available data statistics and the query predicates.

The PolyBase Data Movement Service is responsible for transferring data between SQL Server and external data sources. The Data Movement Service uses a distributed architecture to parallelize data transfer across multiple nodes, which improves query performance and scalability. The Data Movement Service also includes a query pushdown optimizer that pushes down as much of the query processing as possible to the external data source to reduce data transfer overhead.

1.2. PolyBase Use Cases

PolyBase can be used for a variety of scenarios, including:

Big Data Analytics: PolyBase enables SQL Server to query data stored in Hadoop and Apache Spark clusters, enabling enterprises to perform big data analytics without the need for additional ETL tools or data warehousing.
Hybrid Data Integration: PolyBase enables seamless integration of on-premises and cloud-based data sources, including Azure Blob Storage, Azure Data Lake Storage, and SQL Server instances.
Data Archival: PolyBase can be used to archive data to low-cost storage such as Azure Blob Storage or Hadoop, while still enabling easy access to the data using standard T-SQL queries.

2. Understanding Big Data Clusters

SQL Server 2019 introduced Big Data Clusters, which is a new feature that enables the deployment of scalable and high-performance data clusters that can run big data workloads alongside traditional SQL Server workloads. Big Data Clusters build on top of PolyBase to enable seamless integration with diverse data sources, including Hadoop and Apache Spark.

2.1. Big Data Clusters Architecture

Big Data Clusters architecture consists of several components:

SQL Server Master Instance: The SQL Server Master Instance is the primary entry point for client connections and management operations. The Master Instance runs on a Kubernetes cluster and includes the PolyBase Data Pool: The Data Pool is a distributed storage layer that stores and manages data across the cluster. The Data Pool is built on top of HDFS, which provides a fault-tolerant and scalable storage layer. The Data Pool also includes a metadata service that tracks the location and status of data files across the cluster.

Compute Pool: The Compute Pool is a distributed compute layer that runs big data workloads using Apache Spark. The Compute Pool includes Spark drivers, executors, and application masters, which enable the execution of distributed data processing jobs across the cluster. The Compute Pool also includes PolyBase-enabled SQL Server instances, which enable SQL Server workloads to run alongside Spark workloads.

Kubernetes: Kubernetes is an open-source container orchestration platform that manages the deployment and scaling of containers across the cluster. Big Data Clusters run on top of Kubernetes, which provides a scalable and resilient platform for running distributed workloads.

2.2. Big Data Clusters Use Cases

Big Data Clusters can be used for a variety of scenarios, including:

Big Data Analytics: Big Data Clusters enable the integration of big data workloads with traditional SQL Server workloads, enabling enterprises to perform big data analytics using standard T-SQL queries. Big Data Clusters also provide a scalable and high-performance platform for running Apache Spark workloads.

Hybrid Data Integration: Big Data Clusters enable seamless integration of on-premises and cloud-based data sources, including Azure Blob Storage, Azure Data Lake Storage, and SQL Server instances.

Data Science: Big Data Clusters provide a scalable and high-performance platform for running data science workloads using popular tools such as Jupyter Notebooks and RStudio.

Conclusion

PolyBase and Big Data Clusters are powerful features in SQL Server that enable seamless integration with diverse data sources and provide a scalable and high-performance platform for running big data workloads. With PolyBase, SQL Server can query external data sources as if they were native tables, enabling enterprises to perform big data analytics without the need for additional ETL tools or complex data transformation processes. With Big Data Clusters, SQL Server can run big data workloads alongside traditional SQL Server workloads, providing a scalable and high-performance platform for running distributed data processing jobs.

In conclusion, PolyBase and Big Data Clusters are essential tools for enterprises that need to manage and analyze big data effectively. These features enable organizations to harness the power of big data and drive better business insights and decision-making.