Data processing is the hot thing in IT for at least the last 7 years and many interesting cloud services emerged during that time. If you haven’t heard about it, Databricks is a platform that neatly packs open source technologies like Spark, Delta Lake, and MLflow and connects to several main cloud providers (Microsoft Azure, AWS, Google Cloud).
Just a few clicks and you have a ready-to-use Spark cluster, so you can start playing with your data almost immediately. That’s pretty cool. When I logged into the workspace for the first time, I was introduced to the interactive code environment simply called Notebook and this is the easiest way how to run your code. My feeling is that Databricks mainly targets data scientists, not software developers. Notebooks are outstanding for playing with data and writing smaller bits of code for different purposes, but I can’t imagine how to use them for developing larger structured applications.
A few years ago, we developed CRM (Credit Risk Management) platform for IFRS9 calculation and stress testing. It contains many independent calculation Java / Spring modules, orchestrator, and Pentaho Data Integration for ETL, but at the end of the day, it all makes a tool for data transformation. If the tool was developed these days, wouldn’t make more sense to use some 3rd party technology like Spark or even Databricks? That is the question we asked.
It would be easier to maintain and probably quicker to develop because we could focus only on business code. But the biggest benefit would probably be scalability. Simple ECL calculation can be done on a small cluster. When you need to run a demanding stress test, a larger environment can be prepared on-demand in a few minutes.
But how to develop and deploy to Databricks something so large as our stress test application? Of course, it is possible but maybe not so obvious on the first look. So, how would the rather complex stress test application looked when developed on the Databricks platform? Databricks allows you to upload your libraries and this is the way to go.
Library separation and structure
For demo purpose I created a Python project with a setup.py file that creates an egg distribution file, but you can write your application in Java or Scala. In the case of the Credit Risk Management app, we could have:
one library called CRM for the whole application
or multiple libraries for each main use-case (ECL, RWA, stress testing)
In case of a stress test, I would create a separate library for the calculation of expected credit loss. As a quick example, I created one class, which has one UDF for ECL calculation and does join a simple table of PDs. Of course, the full and correct implementation would be much more complicated.
It’s also important to know who is the end-user. Is it someone who wants to just run the calculation and get the data, or an analyst who is going to create different stress scenarios and string together different smaller components? This leads to two structures:
Application with limited API. The user will create the instance, pass some configuration data and execute some “run” function. The application will handle the code flow.
Application is just a library. The user will import functions to his notebook and use them to calculate many intermediate results. The code flow is in his hands.
Application context and entry points
The application shouldn’t create a Spark session nor have knowledge about table paths or names. In general, I wouldn't hardcode anything Spark-related to the application. Instead, the application will use variables and these will be populated during the application initialization (if the application works as a library, just pass these as method arguments).
Create main.py for local development. It’s just a small script that will create a Spark session, create an instance of the application and pass the session with other initialization data to the application.
You can use the application in Databricks notebook or create a separate script for the job
Deployment and usage
This is the easiest part. Upload the library to Databricks (it has to be uploaded to cluster too). If you upload the newer version, don’t forget to remove the older version, restart the cluster and re-attach all attached notebooks.
You can now import it to the notebook, which is going to be your entry point on Databricks. If you want to run some job periodically, you can run either create a small script (like main.py) or run the application from a notebook.
Comentarios