The goal of the FERARI project is to build a framework that allows for efficient and timely processing of Big Data streams. This framework includes a distributed complex event processing (CEP) engine, a Query Planner that dynamically optimizes the CEP engine to the current data distribution and a distributed online learning framework. These components are tied together by a powerful, modular and elastic architecture. This architecture strictly separates the components of the FERARI framework from the underlying distributed computation system. The framework can be adapted to any Big Data streaming platform by exchanging its runtime adaption; after a careful evaluation of existing Big Data streaming platforms we found Apache’s STORM to be the best fitting platform and implemented the runtime adaption accordingly. The framework has been released as open source (available at https://bitbucket.org/sbothe-iais/ferari).
The open source release of our framework allows anyone from the scientific community to people form the industry to explore and use it. Since we provide docker containers for the software, it can be easily installed on any machine, from a personal computer to a cluster or cloud system.
To further facilitate getting involved with the framework, we provide a guide that explains its installation and usage in simple steps, as well as a running example. The software allows users to set up high-performance, communication-efficient distributed stream processing systems by plugging together the provided building blocks – including algorithms based on the in-situ method (see 4.2.1 for a description of this method and the algorithms).
The open source release also includes the distributed online learning framework, which is also implemented as a building block within the architecture. It can be used to set up large distributed machine learning systems for providing real-time services on distributed, dynamic data streams.
The employed novel communication-efficient distributed online learning algorithms are based on scientific publication at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECMLPKDD) and the Annual Conference on Neural Information Processing Systems (NIPS). Currently, most machine learning on Big Data is performed on distributed batches of data. In contrast, the FERARI approach enables real-time learning on Big streaming Data. Thus, the learning framework has potential for substantial impact since it allows scaling learning applications to data volumes and velocities generated by machine-to-machine interaction (M2M) and the Internet of Things.