In recent years the amount of data generating devices is growing rapidly: mobile phones, sensors in cars, smart home devices, or industrial machines. In consequence, exponentially growing amounts of data arising from these sources can be stored and processed. Many of today’s Big Data technologies were built on the tacit assumption of web-based systems processing data generated ultimately by humans. Human-generated data is predominantly persistent, i.e. is required to be stored for relatively long periods of time. As a result, Big Data technologies to date mainly focus on batch processing of data stored on distributed file systems. As Big Data finds its way to other business areas, this design decision becomes limiting. An area with great future potential is machine-to-machine interaction (M2M), and the Internet of Things. This, however, requires processing of massive and predominantly transient data streams. Consequently, current Big Data technology is inadequate for processing contemporary and expected amounts of M2M, and similar data.
The FERARI vision. The goal of the FERARI project is to address these bottlenecks and to pave the way for efficient and timely processing of Big Data. We intend to exploit the structured nature of M2M data while retaining the flexibility required for handling unstructured data elements. Taking into account the structured nature of the data will enable business users to express complex tasks, such as efficiently identifying sequences of events over distributed sources with complex relations, or learning and monitoring sophisticated abstract models of the data. These tasks will be expressed in a high-level declarative language (as opposed to programming them manually as is the case in current streaming systems). The system will be able to perform these tasks in an efficient and timely manner.
Importantly, this systematic approach will enable leveraging recent advances in in-situ processing algorithms, which perform much of the processing at the source where the data is generated. Instead of transporting all the data to a data center for centralized storing and processing, the data is processed in place, and a centralized location is only required to coordinate the processing efforts, and to receive final results. The advantages of in-situ processing are especially important for M2M data, where any transportation of data is truly wasteful since there is no need to store the data. In-situ processing is a crucial component for achieving truly large-scale and geographically distributed scalability: avoiding sending all the data to a centralized location for storage and processing simultaneously addresses both communication and computational scalability issues. By diminishing the need for large centralized infrastructures, huge data transfers, and the respective necessary energy, in-situ processing lowers the cost and environmental ramifications of Big Data stream processing systems by orders of magnitude. Similarly, huge acceleration is obtained in performing real-time knowledge extraction and monitoring.
The FERARI project has assembled a team of world-class research and commercial partners, that will, for the first time, transform these theoretical advances into a practical next-generation Big Data architecture that lives up to the challenges of tomorrow. The use of the system will be shown in two application scenarios from telecommunication, where end users will test the architecture for the two scenarios of mobile phone fraud detection and for cloud health monitoring.