Modular synchronisation jconnector

A modular synchronisation tool: the jconnector

The main objective of Inquiro is to find pertinent data in a big dataset. To be able to do that, we have to index our data in a search engine.

This functionality must be a background job that doesn’t block the application so we developed a dedicated tool to achieve that purpose.

The first version of the tool was a monolithic Java application but as our needs evolved it was more and more difficult to maintain, so we decided to review the project architecture to have a more modular tool that can do more than just basic indexation.

ARCHITECTURE

The architecture has been divided in 3 parts :

  • input plugins to read data from a dedicated data source (database, file system …) and to transform the data to a common format that can be understood by other plugins.
  • process plugins to enrich the input data by adding metadata to it.
  • output plugins to write the result to a destination data source (search engine, log file, same data source as input…).

Input plugins are runned in parallel so you can synchronize items coming from different datasource at the same time.

Process plugins are run one after the other on a same item. In our case, process plugins are the one that takes most time to run. So, for performance purpose, we added the possibility to have several threads running the process chain to handle several items at a time.

As process plugins, output plugins are runned one after the other on a same item but you can configure several processes.

For better understanding the division in three part, here are some examples of complete chain plugins and what it is used for :

1 For more info on oplog see https://docs.mongodb.com/manual/core/replica-set-oplog/.

With these three use cases, we can see that plugins can be reused to respond to different needs. For example, the process plugin to extract text and technical metadata from files is the same for the three use cases. We used a generic InputStream so it can process files coming from different sources (MongoDB, file system…).

DESIGN HIGHLIGHTS

SPRING BOOT

We chose to use Spring boot framework as a base for our synchronisation tool. Indeed, Spring boot is particularly adapted for our use as it makes it easy to build a standalone application.

Moreover, Spring boot allows to build a web application with an embedded server so we can use rest entry points to monitor our application.

Main class of Jconnector :

In our case, it will call the start method of our rest controller.

The rest entry points can be defined in a class with spring annotation @RestController :

In our case, we want to be able to start and shutdown our synchronisation tool. We also have a status entry point to have some feedback about the progress of synchronisation.

CLASSLOADER

We want our synchronisation tool to be scalable and adjustable to fulfil different needs.

To do that, we use the class loader to dynamically load classes packaging inside a jar :

The processManager, inputManager, outputManager handle the list of plugins and execution of process and output plugin chain.

THREADPOOLTASKEXECUTOR

To handle multithreading for process and output plugin chain, we use a ThreadPoolTaskExecutor which is a Spring class that helps with thread execution. Rather than retrieving a thread from the pool and executing yourself, you add your Runnable to the queue and the TaskExecutor uses its internal rules to decide when the task gets executed. Max simultaneous thread can be configured.

FEEDBACKS

The benefits of this new architecture have been proved as it’s now very easy to add new features or to use the synchronization tool for other use case than indexing our data in a search engine.

For example, one new use case developed was to analyze a file system and put the result in a search engine to help an enterprise discover what kind of data it owns after an acquisition.

However it is difficult to keep it fully generic, some process plugins have evolved with specific code depending of the input plugin where the item comes from.

So, we have to refactor the code on a regular basis to keep it maintainable.

Categories: