How to allowDiskUse in PDI Mongo step
Agregation took all memory and crashed
One use-case of our CRM solution is to run custom calculations with custom ETL jobs prepared by the application users. This is a big advantage of CRM because users have better insight to business logic than developers, which are concerned mostly about the technological side of application. The possibility of easy definable calculation scenarios and ETL processes gives more power to the users and more time to the developers, which can focus on implementing new features and fixing bugs. Unfortunately, sometimes tools given to users can fail.
In our case, aggregation executed in PDI Mongo step took all memory and crashed. This can be solved by aggregation option allowDiskUse. When set to true, aggregation operations can write data to the temporary folder. Caching to disc is a standard operation if there isn't enough application memory to fit whole data-set and one would say this will be supported by PDI Mongo step, but it's not. Defined options are trimmed and ignored. There was no way how to use them so we had to fork the project on GitHub and implement the support.
Due to how are options passed internally, only three of them are supported now. allowDiskUse, bypassDocumentValidation and cursor - batchSize. It's important to note that since we use PDI in specific version, only support for version 18.104.22.168-12 was implemented. It was a tricky because plugin uses internal wrappers with private inaccessible fields. This was changed in later versions but in our case, I had to resort to reflection.
You can find the version 22.214.171.124-12 with support on our GitHub in branch 126.96.36.199-aggOpt. Branch aggregation-options contains unfinished port to version 188.8.131.52. In order to install the step, build the project and replace the jar file in PDI folder /system/karaf/system/pentaho/pentaho-mongodb-plugin/184.108.40.206-12. Make sure to delete old karaf cache. Once this is done, usage of allowDiskUse is easy. Just define the whole aggregation with options like in the example below.
Port to version 220.127.116.11 should be easy to do but we will leave it up to you, if you need to use newer version. I still hope the plugin will get the support some day using cleaner implementation.
Author: Luděk Novotný