Let's say that the ETL tool PDI isn't very suitable for dynamic data processing. But you could probably say something similar about other ETL tools. At the end of the day, you need to tell the software how to transform the data. It is only logical that we also have to define the data structure. For those who didn't work with PDI, let me give you an example. If I want to process CSV file with two columns, I have to define those two columns (and value types) in the step which loads the file. If I want to enhance the stream of rows with new column, I have to define the name and type in ETL.
The trouble starts when I want to dynamically create fields and I don't know beforehand their count, names or types. This can depend on a different set of input data or external JSON with configuration. The latter was our case. A request was to make one complex transformation configurable using JSON. New fields had to be added to stream dynamically and since this isn't a common use case, it required some investigation.
Fortunately, PDI has not only a high number of different functions, but it is extendable using step with custom Java or JavaScript logic. In order to insert new fields to stream, we need to understand how PDI handles them. A row is an array of general objects. This row with values is accompanied by another row of the same size with field meta-data. PDI knows this way how to interpret and cast values. Of course, these two rows have to be aligned. Values will be assigned to different fields otherwise. In our example, we will read one field with string containing names and use this string to create multiple new fields per each name in User Defined Java Class step.
I like to start working on a class by generating Main body from Code Snippits - Common use. It will create processRow method and some example code how to resize row from the input and send it back to PDI. In general, we need to get the input row, put our values there and adjust meta-data. The last step is done from GUI by defining field names and types. But since we don't know exact count or maybe even names or types, we cannot do it.
Fields tab is responsible for the modification of object data.outputRowMeta. With this knowledge, we can adjust the outputRowMeta manually in the code. This is actually very easy. We need to create and populate ValueMeta object per each new field and add it to outputRowMeta.
data.outputRowMeta.addValueMeta(new ValueMeta("fieldName", ValueMetaInterface.TYPE_STRING));
We also need to populate actual values by getting the row, resizing it and adding new values. The full class which parses names separated by a comma can look like this.
Object[ ] r = getRow(); // get row
Object[ ] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size() + 1); // resize row
int outputIndex = getInputRowMeta().size(); // get index of first new value
outputRowData[outputIndex] = "newValue"; // put a value on this place
putRow(data.outputRowMeta, r); // output the row
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
Object[ ] r = getRow();
if (r == null) {
setOutputDone();
return false;
}
if (first) {
first = false;
}
String[ ] names = get(Fields.In, "names").getString(r).split(",");
Object[ ] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size() + names.length);
int outputIndex = getInputRowMeta().size();
for (int i = 0; i < names.length; i++) {
outputRowData[outputIndex++] = names[i];
data.outputRowMeta.addValueMeta(new ValueMeta("name " + (i+1),
ValueMetaInterface.TYPE_STRING));
}
putRow(data.outputRowMeta, r);
return true;
}
Of course, we still need to reference new fields by name when we save them for an example to a text file. Another issue is, that these new fields aren't visible in Preview data tab in Spoon. It's up to the user how to handle the dynamic aspect of these fields in Java step - how to access them when the name or count can be unknown. One solution could be trying to get them by method get (Fields.In, "fieldName") and catching an exception. At the end of the day, working with dynamic fields involves writing custom code which goes against transparency and readability of neatly drawn ETL process. For that reason, I would recommend to use dynamic fields sparsely and only when it's really necessary.
Comments