Last week, Microsoft unveiled the first release candidate refresh for SQL Server 2019, with Big Data Clusters being the primary focus of the announcement. This capability allows for the deployment of scalable clusters of SQL Server, Apache Spark, and HDFS containers running on Kubernetes.
Now, the tech giant has brought support for SQL Server 2019 Big Data Clusters PySpark development and query submission in Visual Studio Code. For those unaware, PySpark is essentially a Spark Python API that provides an interface to program and compute clusters.
With the new Apache Spark and Hive extension, users can gain access to Python authoring and editing capabilities. Furthermore, integration with Jupyter Notebooks is also offered through this feature, enabling the import and export of .ipynb files. Aside from the enhanced editing abilities, users can also run selected lines of code, and experience results in the form of interactive visualizations. A sample use case for this would be the plotting of 2D graphs through matplotlib, a Python library. PySpark jobs can now also be submitted to big data clusters in SQL Server 2019, or be authored by data engineers through Azure Data Studio.
Here's a rundown of the the aforementioned features and others in a more compact form:
- You can link to SQL Server: The toolkit enables you to connect and submit PySpark jobs to SQL Server 2019 Big Data Clusters.
- Python editing: Develop PySpark applications with native Python authoring support (e.g. IntelliSense, auto format, error checking, etc.).
- Jupyter Notebook integration: Import and export .ipynb files.
- PySpark interactive: Run selected lines of code, or notebook like cell PySpark execution, and interactive visualizations.
- PySpark batch: Submit PySpark applications to SQL Server 2019 Big Data Clusters.
- PySpark monitoring: Integrate with the Apache Spark history server to view job history, debug, and diagnose Spark jobs.
If you're interested in checking out the new capabilities, you can download Visual Studio Code for Windows, Mac, and Linux here. For the latter two, you will also be required to install Mono 4.2.x. Then, you can get the latest Apache Spark and Hive tools from the VS Code Marketplace, or alternatively, browse the extension repository. You can learn more about how to use these tools to create and submit PySpark scripts to clusters from here.