Many of today’s Big Data technologies were built on the tacit assumption of web-based systems processing human generated-data at Facebook or Google. Such data is typically unstructured and predominantly persistent. As a result, the focus to date has been on batch processing of Big Data stored in distributed file systems. However data volumes generated from Machine-to-Machine interactions (M2M) surpass by far the amount of data generated by humans. Example applications include smart energy grids, car-to-car communications, mobile network monitoring, and automated negotiation systems – all identified as important hot use cases for Big Data.As Big Data finds its way to these critical M2M based business areas, these design decisions need to be fundamentally rethought. M2M data streams must be processed in real-time, are predominantly transient and naturally distributed, and are much more structured in nature. Current Big Data technologies are inadequate for handling massive M2M data streams, lacking the smartness and flexibility to allow non-expert users to set up complex analytics tasks, as well as the speed and scalability to support real-time, planetary-scale services over distributed data sources.
The goal of the FERARI project is to address these limitations and to pave the way for efficient, real-time Big Data technologies of the future. It will enable business users to express complex analytics tasks through a high-level declarative language that supports distributed Complex Event Processing and sophisticated machine learning operators as an integral part of the system architecture. Effective, real-time execution at scale will be achieved by making the sensor layer a first-class citizen in distributed streaming architectures and leveraging in-situ data processing as a first (and, in the long run, the only realistic) choice for realizing planetary-scale Big Data systems. The system is tested on massive real-world data sets from the telecommunication domain.