Last week, Microsoft announced the availability of the latest cumulative update (CU5) for SQL Server 2019, which focused towards expanding the capabilities offered through Big Data Clusters. Along with other changes that were delivered, the Apache Spark Connector for Azure SQL and SQL Server was revealed to have been open-sourced under the ApacheV2 license. Now, the connector has been made available on GitHub in the form of a V1 release.
Based on the Spark DataSourceV1 API and SQL Server Bulk API, the Apache Spark Connector enables the usage of transactional data in big data analytics. Moreover, it offers the ability to utilize both on-premise and in-cloud SQL databases as a source of input data or as an output data sink for Spark jobs. It can also work up to 15 times faster than the default JDBC connector, depending upon the sort of scenario that is being undertaken.
The connector boasts a variety of other features as well, the notable ones among which are:
- Support for all Spark bindings (Scala, Python, R).
- Basic authentication and Active Directory (AD) keytab support.
- Reordered DataFrame write support.
- Reliable connector support for single instance.
Updates for the connector to improve it further are already in the pipeline. You can check out the project and all its associated files here on GitHub.