PDI SAS Reader
If you work in FinTech, sooner or later (but probably sooner) SAS data set will get into your project. In a field dominated by one system, there is hardly a place for rejection. Unfortunately, for small companies it is very hard to integrate SAS in their stack. In our projects, we use ETL to gather data from different sources. To be more precise, we use open source tool Pentaho Data Integration.
Even though it has some flaws, it worked very well for us so far - with one little, yet important exception. The input step isn't able to read compressed SAS datasets. After discussing different options, we opted to roll out our own solution. Starting from a scratch probably wouldn't be worth it and we, fortunately, didn't have to do that. We found Parso library - lightweight open source SAS7BDAT reader.
The API was very easy to learn. On the other hand it took significantly more time to code the plugin itself as we didn't have previous experience with making PDI plugins. Still, it took us around 2 days of pure development time to have something usable and about week of testing, fixing bugs and implementing nice to have features.
We just love to use open source software in our projects, therefore it felt right to open source PDI SAS reader too. So if you have any problem or feature idea, please visit our issue tracker on Github.
But enough with history, let's get into those meaty features we implemented. The most significant quality of life improvement is the way how columns are defined. If you don't generate the SAS data set by yourself (but maybe even if you do), it might happen that the order of columns changes.
Not surprisingly the reader step is using columns' names to identify fields in stream. This is done only with first row and indices are cached, so the lookup shouldn't have impact on performance. New columns can be added this way anywhere to data set and it won't break your transformation, which is just awesome.
The latest feature (in the time of writing this article still in snapshot) is also related to columns. At the start of the project, all columns were mandatory. If any was missing, the error was thrown. This was changed later, thus optional columns were implemented. If the column is flagged, his presence inside the file isn't checked and if missing, field in stream will have null value.
Many other checks were implemented too - missing step connections, accessibility of SAS file, file emptiness check or column presence check.
Warnings, errors and additional information about file (number of columns and rows) are shown to the user when using Verify this transformation button. The usability of the step itself was also important topic for us. The output field names can be renamed and values can be converted to desired output format.
We thought it would be a good idea to force output format to BigNumber, when using feature Get Fields (to save the mouse from some clicks), so we added the option which will suggest this format instead of Number when checked. It turned out it wasn't needed at all for our use case but you might find it useful for some projects. In fact, we would like to know your opinion about the plugin in general. Give us your input and who knows, we may implement your desired feature!
Author: Luděk Novotný