top of page
1-modra.jpg
ClsoeIT Logo

How to allowDiskUse in PDI Mongo step

Agregation took all memory and crashed

One use-case of our CRM solution is to run custom calculations with custom ETL jobs prepared by the application users. This is a big advantage of CRM because users have better insight to business logic than developers, which are concerned mostly about the technological side of application. The possibility of easy definable calculation scenarios and ETL processes gives more power to the users and more time to the developers, which can focus on implementing new features and fixing bugs. Unfortunately, sometimes tools given to users can fail.


In our case, aggregation executed in PDI Mongo step took all memory and crashed. This can be solved by aggregation option allowDiskUse. When set to true, aggregation operations can write data to the temporary folder. Caching to disc is a standard operation if there isn't enough application memory to fit whole data-set and one would say this will be supported by PDI Mongo step, but it's not. Defined options are trimmed and ignored. There was no way how to use them so we had to fork the project on GitHub and implement the support


How to aloowdiscuse in MongoDB screenshot

Due to how are options passed internally, only three of them are supported now. allowDiskUse, bypassDocumentValidation and cursor - batchSize. It's important to note that since we use PDI in specific version, only support for version 7.1.0.0-12 was implemented. It was a tricky because plugin uses internal wrappers with private inaccessible fields. This was changed in later versions but in our case, I had to resort to reflection.


You can find the version 7.1.0.0-12 with support on our GitHub in branch 7.1.0.0-aggOpt. Branch aggregation-options contains unfinished port to version 9.0.0.0. In order to install the step, build the project and replace the jar file in PDI folder /system/karaf/system/pentaho/pentaho-mongodb-plugin/7.1.0.0-12. Make sure to delete old karaf cache. Once this is done, usage of allowDiskUse is easy. Just define the whole aggregation with options like in the example below.

[
      { 
            $match:{
                   "name": "Tibor"
            }
      }
],
{
      allowDiskUse:false
}

Port to version 9.0.0.0 should be easy to do but we will leave it up to you, if you need to use newer version. I still hope the plugin will get the support some day using cleaner implementation.

Related Posts

See All

Validating and generating Atlassian JWT

When developing a add-on to cloud Jira, a usual need is to communicate with the Jira using REST API. The authentication is done using JWT and it took us a while to figure out how to validate and gener

MGS integration with antivirus

One of the MGS features is to manage model-related files and documents. Of course, common and other model non-related files can be uploaded also to the public folder, users' private folder, or shared

Flattening Docker images

Docker images are stored as layers. To be more precise, the filesystem is layered, so each change (RUN, COPY,…) will result in adding a new layer. This approach has many advantages - images can be bui

bottom of page