top of page
1-modra.jpg
ClsoeIT Logo

Dynamic Fields in Pentaho Data Integration

Let's say that the ETL tool PDI isn't very suitable for dynamic data processing. But you could probably say something similar about other ETL tools. At the end of the day, you need to tell the software how to transform the data. It is only logical that we also have to define the data structure. For those who didn't work with PDI, let me give you an example. If I want to process CSV file with two columns, I have to define those two columns (and value types) in the step which loads the file. If I want to enhance the stream of rows with new column, I have to define the name and type in ETL.


The trouble starts when I want to dynamically create fields and I don't know beforehand their count, names or types. This can depend on a different set of input data or external JSON with configuration. The latter was our case. A request was to make one complex transformation configurable using JSON. New fields had to be added to stream dynamically and since this isn't a common use case, it required some investigation.


Fortunately, PDI has not only a high number of different functions, but it is extendable using step with custom Java or JavaScript logic. In order to insert new fields to stream, we need to understand how PDI handles them. A row is an array of general objects. This row with values is accompanied by another row of the same size with field meta-data. PDI knows this way how to interpret and cast values. Of course, these two rows have to be aligned. Values will be assigned to different fields otherwise. In our example, we will read one field with string containing names and use this string to create multiple new fields per each name in User Defined Java Class step.


Subtasks Navigation for Jira promotion Banner

I like to start working on a class by generating Main body from Code Snippits - Common use. It will create processRow method and some example code how to resize row from the input and send it back to PDI. In general, we need to get the input row, put our values there and adjust meta-data. The last step is done from GUI by defining field names and types. But since we don't know exact count or maybe even names or types, we cannot do it.


Pentaho Data Integration fields
PDI Fields

Fields tab is responsible for the modification of object data.outputRowMeta. With this knowledge, we can adjust the outputRowMeta manually in the code. This is actually very easy. We need to create and populate ValueMeta object per each new field and add it to outputRowMeta.

data.outputRowMeta.addValueMeta(new ValueMeta("fieldName", ValueMetaInterface.TYPE_STRING));

We also need to populate actual values by getting the row, resizing it and adding new values. The full class which parses names separated by a comma can look like this.

Object[ ] r = getRow(); // get row

Object[ ] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size() + 1); // resize row

int outputIndex = getInputRowMeta().size(); // get index of first new value

outputRowData[outputIndex] = "newValue"; // put a value on this place

putRow(data.outputRowMeta, r); // output the row



public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
    Object[ ] r = getRow();

    if (r == null) {
        setOutputDone();
        return false;
    }

    if (first) {
        first = false;
    }

    String[ ] names = get(Fields.In, "names").getString(r).split(",");

    Object[ ] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size() + names.length);
    int outputIndex = getInputRowMeta().size();

    for (int i = 0; i < names.length; i++) {
        outputRowData[outputIndex++] = names[i];
        data.outputRowMeta.addValueMeta(new ValueMeta("name " + (i+1),
ValueMetaInterface.TYPE_STRING));
    }

    putRow(data.outputRowMeta, r);
    return true;
}

Of course, we still need to reference new fields by name when we save them for an example to a text file. Another issue is, that these new fields aren't visible in Preview data tab in Spoon.  It's up to the user how to handle the dynamic aspect of these fields in Java step - how to access them when the name or count can be unknown. One solution could be trying to get them by method get (Fields.In, "fieldName") and catching an exception. At the end of the day, working with dynamic fields involves writing custom code which goes against transparency and readability of neatly drawn ETL process. For that reason, I would recommend to use dynamic fields sparsely and only when it's really necessary.


Subtasks Navigation for Jira promotion Banner

Related Posts

See All

Validating and generating Atlassian JWT

When developing a add-on to cloud Jira, a usual need is to communicate with the Jira using REST API. The authentication is done using JWT and it took us a while to figure out how to validate and gener

MGS integration with antivirus

One of the MGS features is to manage model-related files and documents. Of course, common and other model non-related files can be uploaded also to the public folder, users' private folder, or shared

Flattening Docker images

Docker images are stored as layers. To be more precise, the filesystem is layered, so each change (RUN, COPY,…) will result in adding a new layer. This approach has many advantages - images can be bui

Comments


bottom of page