Solving the Unirest JSON parsing problem on Databricks
Databricks Runtime is a set of common libraries you don’t have to install separately on the cluster. One of these libraries is Google Gson, that is used for mapping JSON data to Java objects. It is a dependency of many other libraries, Unirest being one of them.
We use Unirest REST client in the MGS client and it worked well so far. It makes REST requests very easy to construct and execute. It has one corner case where it fails though. Let’s demonstrate it on a simple example. We need an endpoint that returns an array of objects.
This will fail with the error java.lang.ClassCastException: Cannot cast com.google.gson.JsonArray to com.google.gson.JsonObject on runtime 8.1. The worst thing is that everything works fine when testing the code locally. Unirest will use the newest version of Gson dependency, which works fine. Even when my JAR contains all dependencies, Databricks will still use an older version. Installing Gson manually on the cluster won’t help. So what are our options?
Cluster init script
We can remove the Gson 2.2.4. and install the newest one using cluster init script. This is documented in the official Databricks documentation, but it has two disadvantages. Because the API or function could change between versions, we would have to test it thoroughly because there is a risk of breaking some other dependency.
The second problem is related to how we use Unirest. It’s not used directly in the Notebooks. Since it’s a part of the MGS client library, we can’t just tell users to play with Databricks runtime using cluster init script. Most users won’t do it because it’s too risky and too technical.
Of course, the best solution would be new Databricks runtime with an upgraded version. I don’t think that using Unirest on REST resource starting as an array is something very uncommon.
Mapping to a custom class
For some reason, mapping JSON to a custom class works even when the top level is still an array. In Notebooks, this is not a very nice option and I would rather return a response as String and do JSON parsing separately. It will barely add one additional line of code. But since we develop a library, I can define as many custom classes for different responses as I wish. It’s even easier when the REST target has a package with separated DTO objects.
In the case of our first example above, I would define a class User.
And the rest call would map the response to an array of users.
Not only this works without a ClassCastException, it’s also a cleaner solution. I’m not very happy that I have to end a solution to a problem with - just don’t use it (asJson()). But I hope I provided some options you may find useful. If you have some other solution related more to Gson and Databricks runtime, feel free to share it on the Databricks forum.
Author: Luděk Novotný