top of page
1-modra.jpg
ClsoeIT Logo

Is Databricks good for complex applications?

Data processing is the hot thing in IT for at least the last 7 years and many interesting cloud services emerged during that time. If you haven’t heard about it, Databricks is a platform that neatly packs open source technologies like Spark, Delta Lake, and MLflow and connects to several main cloud providers (Microsoft Azure, AWS, Google Cloud).


Just a few clicks and you have a ready-to-use Spark cluster, so you can start playing with your data almost immediately. That’s pretty cool. When I logged into the workspace for the first time, I was introduced to the interactive code environment simply called Notebook and this is the easiest way how to run your code. My feeling is that Databricks mainly targets data scientists, not software developers. Notebooks are outstanding for playing with data and writing smaller bits of code for different purposes, but I can’t imagine how to use them for developing larger structured applications.


A few years ago, we developed CRM (Credit Risk Management) platform for IFRS9 calculation and stress testing. It contains many independent calculation Java / Spring modules, orchestrator, and Pentaho Data Integration for ETL, but at the end of the day, it all makes a tool for data transformation. If the tool was developed these days, wouldn’t make more sense to use some 3rd party technology like Spark or even Databricks? That is the question we asked.


It would be easier to maintain and probably quicker to develop because we could focus only on business code. But the biggest benefit would probably be scalability. Simple ECL calculation can be done on a small cluster. When you need to run a demanding stress test, a larger environment can be prepared on-demand in a few minutes.


But how to develop and deploy to Databricks something so large as our stress test application? Of course, it is possible but maybe not so obvious on the first look. So, how would the rather complex stress test application looked when developed on the Databricks platform? Databricks allows you to upload your libraries and this is the way to go.


Subtasks Navigation for Jira promotion Banner

Library separation and structure

For demo purpose I created a Python project with a setup.py file that creates an egg distribution file, but you can write your application in Java or Scala. In the case of the Credit Risk Management app, we could have:


  • one library called CRM for the whole application

  • or multiple libraries for each main use-case (ECL, RWA, stress testing)


In case of a stress test, I would create a separate library for the calculation of expected credit loss. As a quick example, I created one class, which has one UDF for ECL calculation and does join a simple table of PDs. Of course, the full and correct implementation would be much more complicated.


CRM Databrick demo

It’s also important to know who is the end-user. Is it someone who wants to just run the calculation and get the data, or an analyst who is going to create different stress scenarios and string together different smaller components? This leads to two structures:


  • Application with limited API. The user will create the instance, pass some configuration data and execute some “run” function. The application will handle the code flow.

  • Application is just a library. The user will import functions to his notebook and use them to calculate many intermediate results. The code flow is in his hands.


Application context and entry points

The application shouldn’t create a Spark session nor have knowledge about table paths or names. In general, I wouldn't hardcode anything Spark-related to the application. Instead, the application will use variables and these will be populated during the application initialization (if the application works as a library, just pass these as method arguments).


  • Create main.py for local development. It’s just a small script that will create a Spark session, create an instance of the application and pass the session with other initialization data to the application.

  • You can use the application in Databricks notebook or create a separate script for the job


Deployment and usage

This is the easiest part. Upload the library to Databricks (it has to be uploaded to cluster too). If you upload the newer version, don’t forget to remove the older version, restart the cluster and re-attach all attached notebooks.


CRM Notebook usage

You can now import it to the notebook, which is going to be your entry point on Databricks. If you want to run some job periodically, you can run either create a small script (like main.py) or run the application from a notebook.


Subtasks Navigation for Jira promotion Banner

Related Posts

See All

Validating and generating Atlassian JWT

When developing a add-on to cloud Jira, a usual need is to communicate with the Jira using REST API. The authentication is done using JWT and it took us a while to figure out how to validate and gener

MGS integration with antivirus

One of the MGS features is to manage model-related files and documents. Of course, common and other model non-related files can be uploaded also to the public folder, users' private folder, or shared

Flattening Docker images

Docker images are stored as layers. To be more precise, the filesystem is layered, so each change (RUN, COPY,…) will result in adding a new layer. This approach has many advantages - images can be bui

Hozzászólások


bottom of page