Processing quite large text files is not an easy thing to accomplish even with all the “good guys” around like Hadoop, powerful machines, concurrency frameworks (others than Java concurrency utilities). And this because using those comes with a cost (and here we can mention money, time, or persons with necessary qualification), that not all the time is negligible and in the same time with limitations.
For example, if you have to validate some of the content with a 3rd service, using Hadoop for this is a well-known anti-pattern.
Using powerful machines is debatable from project to project; maybe some client just doesn’t want to pay extra money for making faster a single functionality that is not even called that often.
Using concurrency frameworks can still be a huge impediment, now with all those actor frameworks in place, the trend is to know less about how it runs (unfortunately) and just to be good at using it even for plain Java concurrency package.
I know what you are thinking now; you can just read the file line by line processing it, then save state, using buffers and clean java code – let’s call this statement 1.
What if I tell you, processing of file should be a nice atomic action, where validations are being made over each line and counts or other metadata from header of the file or trailer, even entities groups inside the file? And if the file is valid (based on what that means for each business requirements) the process has to save some events for each processed line. Based on the above applying statement 1 will not deserve our case anymore because we have to provide atomic processing. Saving the processed lines in memory till the file is processed will lead you to a nice OOME (Out of memory exception) – for large ones.
Presenting you our solution:
Chronicle. Java Chronicle.
What is this product?
“Inter Process Communication ( IPC ) with sub millisecond latency and able to store every message:”
Img. Product official definition and usage - Chronicle-Queue (2015)
Process:
ChronicleConfig config = ChronicleConfig.DEFAULT.clone();
Chronicle entitiesChronicle = new IndexedChronicle("path", config);
reading the lines from file
unmarshalling (with beanIO – out of the scope of this article) to POJO,
validating the content of entity
create additional entities (business requirements) using info from in process entity
BytesMarshallable
) entities
public void writeMarshallable(@NotNull Bytes out) {
if (null == entityUuid) {
out.writeBoolean(false);
} else {
out.writeBoolean(true);
out.writeUTFΔ(entityUuid.toString());
}
...
writeListEntityMessages(messages, out);
out.writeStopBit(-1);
}
ExcerptAppender
)
// Start an excerpt with given chunksize
int objectSize = getObjectSize(entity); //how many bytes
entitiesAppender.startExcerpt(objectSize);
// Write the object bytes
entity.writeMarshallable(entitiesAppender);
// pad it for later.
entitiesAppender.position(objectSize);
after all content used from file, if “checks” are passed then
ExcerptTailer
)
ExcerptTailer reader = entitiesChronicle.createTailer();
Entity entity = new Entity ();
entity.readMarshallable(reader);
public void readMarshallable(\@NotNull Bytes in)
throws IllegalStateException {
StringBuilder valueToRead = new StringBuilder(100);
boolean hasId = in.readBoolean();
if (hasId) {
entityUuid = readUuidValue(valueToRead, in);
}
…
messages = readListEntityMessages(in);
in.readStopBit();
}
save final state to Cassandra
entitiesChronicle.close();
entitiesChronicle.clear();
And since a number can say more than 1000 words:
Keeping all those entities in heap will not be possible within commodity hardware (and without any involvement from gc), but using off-heap memory becomes a trivial action using chronicle.
As you can see, the write and read entities from chronicle (memory-mapped file) take about the same time as read lines, so instead of reading the file twice as an alternative, you can use chronicle and get unmarshalling of based entities for free. It took us two hours till we were able to save our first entities; the API is nice and clean, easy to use and understand.
The latest version of library has even more specialized functionalities like:
chronicle queue
chronicle map
chronicle logger
chronicle engine
Definitely there are some other things to address and analyze but based on our needs that fit the requirements with the cheapest effort.