Exploitable Assets

Asset: Open Software Repository and Distributed Online Learning Framework

The goal of the FERARI project is to build a framework that allows for efficient and timely processing of Big Data streams. This framework includes a distributed complex event processing (CEP) engine, a Query Planner that dynamically optimizes the CEP engine to the current data distribution and a distributed online learning framework. These components are tied together by a powerful, modular and elastic architecture. This architecture strictly separates the components of the FERARI framework from the underlying distributed computation system. The framework can be adapted to any Big Data streaming platform by exchanging its runtime adaption; after a careful evaluation of existing Big Data streaming platforms we found Apache’s STORM to be the best fitting platform and implemented the runtime adaption accordingly. The framework has been released as open source (available at https://bitbucket.org/sbothe-iais/ferari).

The open source release of our framework allows anyone from the scientific community to people form the industry to explore and use it. Since we provide docker containers for the software, it can be easily installed on any machine, from a personal computer to a cluster or cloud system.

To further facilitate getting involved with the framework, we provide a guide that explains its installation and usage in simple steps, as well as a running example. The software allows users to set up high-performance, communication-efficient distributed stream processing systems by plugging together the provided building blocks – including algorithms based on the in-situ method (see 4.2.1 for a description of this method and the algorithms).

The open source release also includes the distributed online learning framework, which is also implemented as a building block within the architecture. It can be used to set up large distributed machine learning systems for providing real-time services on distributed, dynamic data streams.

The employed novel communication-efficient distributed online learning algorithms are based on scientific publication at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECMLPKDD) and the Annual Conference on Neural Information Processing Systems (NIPS). Currently, most machine learning on Big Data is performed on distributed batches of data. In contrast, the FERARI approach enables real-time learning on Big streaming Data. Thus, the learning framework has potential for substantial impact since it allows scaling learning applications to data volumes and velocities generated by machine-to-machine interaction (M2M) and the Internet of Things.

Asset: In-situ Monitoring Technologies

Modern information systems are inherently distributed. In many cases, such as telecommunication systems, content management systems, and financial systems, data is processed in physically distributed locations. Even when data is processed at a central location, modern data centres are comprised of thousands (or even tens of thousands) of computation and storage units, which in practice process data in a distributed manner.

Consequently, monitoring the data processed by these systems to detect in real-time global events of interest is becoming increasingly challenging. This challenge is clearly evident in the FERARI use cases. The first use case deals with telecom fraud. Fraudsters exploit the distributed nature of telecom systems to devise ever more sophisticated methods of fraud. The second use case is system health monitoring, which consists of the real-time detection of abnormal performance of system components so that they can be serviced before they fail, and thus enhance the availability and robustness of the system while reducing costs.

The central goal of the FERARI project is to preform most of the processing of distributed data in a local manner, and thus detect complex patterns of interest (e.g. fraudulent activity or abnormal performance) without collecting data to a centralized location. This technology is referred to as in-situ processing. The FERARI project has introduced several novel innovations in the application of in-situ methods to real-world applications.

The innovations introduced by FERARI to in-situ processing are within the geometric monitoring framework[1], which enables defining complex events by applying an arbitrary function to data from distributed sources. Geometric monitoring solutions enable breaking up such monitoring tasks into individual constraints on the data generated at the local sites. The idea is that as long as none of the local constraints are violated, it is guaranteed that the global event of interest has not occurred.

Distributing the detection among the local sites reduces time delays and bottle necks associated with collecting data to a central location, and thus enables detecting such events in real-time. A major drawback of geometric monitoring, however, is that applying the types of constraints proposed thus far may be very demanding in terms of computational resources at the local sites.

The first innovation introduced by FERARI is a novel type of local constraint for geometric monitoring, known as convex bounds. While existing methods for determining local constraints rely on solutions to complex optimization problems, convex bounds employ functional analysis methods for selecting a convex function that bounds the function used to define the event of interest. While specifying such bounds for a given function may require some mathematical expertise, once it has been specified, applying the bound on data in run-time is very efficient. In addition, while this has not been the original motivation, empirical evidence has shown that these constraints are also very efficient in the sense that they produce very small number of false negatives (i.e. instances where that constraints were violated but the global event has not occurred) in comparison to existing methods.

A second innovation introduced by FERARI is the adaptation of geometric monitoring methods to dynamic large-scale industrial settings. So far, geometric monitoring has been applied in laboratory settings as part of academic research. Using geometric monitoring in an industrial production environment requires extending the method with several non-trivial adaptation.

Laboratory setups assume that the number of sites participating in the monitoring task is fixed and known in advance. Typically the algorithms are run on at most several hundred sites. Applying the method to a distributed counting task over CDR records produced by the cell towers of a telecom provider, as described in Deliverable D3.2, involves running the monitoring task on over 18,000 cell towers. Additionally, the set of active subscribers in not known in advance, and every subscriber interacts only with a small subset of these cell towers.

The efficient application of geometric monitoring in this case required extending the monitoring framework to simultaneously run concurrent monitoring tasks for an unknown and dynamic set of subscribers. Furthermore the local constraints were modified to support a different and dynamic set of sites for each subscriber.

[1] Sharfman, Izchak, Assaf Schuster, and Daniel Keren. “A geometric approach to monitoring threshold functions over distributed data streams.” ACM Transactions on Database Systems (TODS) 32.4 (2007): 23.

 

Asset: ProtonOnStorm

According to Gartner [1][2][3][4] and [5], two forms of stream processing software have emerged in the past 15 years. The first were CEP systems that are general purpose development and runtime tools that are used by developers to build custom, event-processing applications without having to re-implement the core algorithms for handling event streams; as they provide the necessary building blocks to build the event driven applications. Modern commercial CEP platform products even include adapters to integrate with event sources, development and testing tools, dashboard and alerting tools, and administration tools. More recently the second form — distributed stream computing platforms (DSCPs) such as Amazon Web Services Kinesis[1] and open source offerings including Apache Samza[2], Spark[3] and Storm[4] — was developed. DSCPs are general-purpose platforms without full native CEP analytic functions and associated accessories, but they are highly scalable and extensible and usually offer an open programming model, so developers can add the logic to address many kinds of stream processing applications, including some CEP solutions. Therefore, they are not considered “real” complex event processing platforms. Specially, Apache open source projects (Storm, Spark, and recently Samza) have gained a fair amount of attention and interest ([5], [6]) and these may well mature into commercial offerings in future and/or get embedded in existing commercial product sets. DSCPs are designed to cope with Big Data requirements making them an essential component in any organization infrastructure.

Today, there are already some implementations that take advantage of the pattern recognition capability of CEP systems along with the scalability capabilities that offer DSCPs, and offer a holistic architecture. ProtonOnStom is one example.

IBM partner has implemented its open source complex event processing research tool IBM PROactive Technology ONline (PROTON) on top of Storm in the scope of the FP7 EU FERARI project, thus making it a distributed and scalable CEP engine. ProtonOnStorm has been released to open source under the Apache 2.0 license. The source code along with manuals can be accessed at[5].

ProtonOnStorm is currently applied in two other EU projects for different purposes: in SPEEDD (FP7) is extended to cope with certain aspects of uncertainty while in Psymbiosys (H2020) is applied to different scenarios of manufacturing intelligence that use IoT devices. Both projects deal with Big Data.

ProtonOnStorm programming model serves as basis for the optimization plans that are done in the scope of the project. The derived events out of the system in the form of fraud alerts feed the dashboard of the project.

ProtonOnStorm is the underlying CEP engine in FERARI demo that has been accepted for SIGMOD16 and will be shown also as invited demo in DEBS16

[1] http://aws.amazon.com/kinesis/

[2] http://samza.incubator.apache.org/

[3] http://spark.apache.org/streaming/

[4] https://storm.apache.org/

[5] https://github.com/ishkin/Proton/

 

Asset: The Event Model (TEM)

The Event Model (TEM) provides a new way to model, develop, validate, maintain, and implement event-driven applications. In TEM, the event derivation logic is expressed through a high-level declarative language through a collection of normalized tables (Excel spreadsheet like fashion). These tables can be validated and transformed into code generation. TEM is based on a set of well-defined principles and building blocks, and does not require substantial programming skills, therefore target to non-technical people.

This idea has already been successfully proven in the domain of business rules by The Decision Model (TDM) [8][8][6] . TDM groups the rules into natural logical groups to create a structure that makes the model relatively simple to understand, communicate, and manage.

The Event Model follows the Model Driven Engineering approach and can be classified as a CIM (Computation Independent Model), providing independence in the physical data representation, and omitting details which are obvious to the designer.  This model can be directly translated into an execution model (PSM – Platform Specific Model in the Model Driven Architecture terminology) through an intermediate generic representation (PIM – Platform Independent Model). In the course of the first two years in FERARI, we have developed the CIM model for functional and non-functional requirements of event driven applications. In the third year we plan to articulate the translation of TEM (the CIM) into an event processing network (PIM). The event processing network can then be converted into a JSON file and eventually to a running application (PSM).

During the first two years in the FERARI project we extended the basic model we had and completed all the functional aspects as well as the non-functional requirements. The work until now has been theoretical and reported through FERARI WP4 deliverables[1]. The model parts for the functional requirements are described in the paper: A Model Driven Approach for Event Processing Applications accepted for DEBS16.

[1] Available at: http://www.ferari-project.eu/key-deliverables/

 

Asset: CEP optimizer

The asset, namely the FERARI CEP optimizer, is being developed within WP5. Its design principles and current implementation details were reported in Deliverable D5.2. In FERARI, we aim at designing solutions capable of scaling at a planetary level, thus retaining the potential of being applicable to arbitrarily large businesses and data volumes.  Since in vast scale applications (event) data of interest are produced or collected at remote sources; they need to be combined to respond to global application inquiries. In this context, collecting voluminous data at a central computing point (termed site/cloud) first and then processing them is infeasible because such a solution would abuse the available network resources of the underlying infrastructure and would cause a bottleneck at that central point. Data of interest arriving at multiple, potentially geographically dispersed, sites should be efficiently processed in-situ (if possible) first and then wisely combined at an inter-cloud level to provide holistic results to final stakeholders. Efficiency in such a setting calls for reduced communication to avoid congested network links among sites without affecting the timely delivery of notifications and application reports. The above goals are in the core of our CEP optimizer which produces execution plans for application queries that orchestrate site interactions in a way that both ensures:

  1. a) optimal network utilization and b) compliance with application Quality-of-Service (QoS) requirements related to the time horizon, from the occurrence of interesting situations, in which corresponding reports should be made available to end users.

Our algorithms and design principles are generic enough to support a wide range of application query functionality. They can be employed on top of any CEP Engine being selected as the software responsible for intra-cloud data processing and query execution at the site level. Therefore, our approach can be fostered as a paradigm for any similar implementation irrespectively of the CEP Engine or specific application demands.

Having clarified that, in the context of FERARI, the currently developed software is built to support – in terms of application query operators and respective functionality – IBM Proactive Technology Online (Proton) CEP Engine and in particular its streaming cloud extension, namely ProtonOnStorm. This is not too great a restriction as Proton and ProtonOnStorm are open source platforms with an already important user base, as will be explained below. Moreover, the TEM specification which is another important FERARI asset for letting business users express inquiries in a declarative way (details see above), can be mapped to an Event Processing Network (EPN) conceptualization which is supported by ProtonOnStorm at the technical level.

Therefore, the TEM specification is commutatively supported by our approach. In addition, the CEP optimizer outcomes incorporate and exploit the FERARI assets related to in situ processing at the site level.

The CEP optimizer, as a FERARI asset, has already been pushed to the scientific community by publishing corresponding results – in close collaboration with the rest of the FERARI consortium – in the top database conference and the most well esteemed conference within the event processing community:

FERARI: A Prototype for Complex Event Processing over Streaming Multi-cloud Platforms,

  1. Flouris, V. Manikaki, N. Giatrakos, A. Deligiannakis, M. Garofalakis, M. Mock, S. Bothe, I. Skarbovsky, F. Fournier, T. Krizan, M. Stajcer, J. Yom-Tov, T. Curin

in: SIGMOD, 2016 (accepted –  to appear).

Complex Event Processing over Streaming Multi-cloud Platforms – The FERARI Approach,

  1. Flouris, V. Manikaki, N. Giatrakos, A. Deligiannakis, M. Garofalakis, M. Mock, S. Bothe, I. Skarbovsky, F. Fournier, T. Krizan, M. Stajcer, J. Yom-Tov, M. Volarevic

in: DEBS, 2016 (invited submission – under review).

 

Asset: Data Masking Algorithms

The open source data masking framework has potential to be exploited in the industry as well as in the scientific domain. Within data masking framework three main concepts can be identified as exploitable.

First is Test data management which can be used to produce test data. Data masking users can make sure employees don’t do something wrong with corporate data, like making private data sets public, or moving production data to insecure test environments. In reality, masking data for testing and sharing is almost a trivial subset of the full customer requirement.

The real goal is administration of the entire data security lifecycle – including locating, moving, managing, and masking data. The mature version of today’s simpler use case is a set of enterprise data management capabilities which control the flow of data to and from hundreds of different databases or flat files. This capability answers many of the most basic security questions we hear customers ask, such as “Where is my sensitive data?” “Who is using it?” and “How can we effectively reduce the risks to that information?”.

Compliance is the second concept and major reason stated by users as reason they need masking products. Unlike most of today’s emerging security technologies, Early customers came specifically from finance, but adoption is well distributed across different segments, including particularly retail, telecomm, health care, energy, education, and government. The diversity of customer requirements makes it difficult to pinpoint any one regulatory concern that stands out from the rest. During discussions we hear about all the usual suspects – including PCI, NERC, GLBA, FERPA, HIPAA, and in some cases multiple requirements at the same time. These days we hear about masking being deployed as a more generic control – in form of protection of Personally Identifiable Information (PII), health records, and general customer records, among other concerns; but we no longer see every company focused on only one specific regulation or requirement. Now masking is perceived as addressing a general need to avoid unwanted data access, or to reduce exposure as part of an overall compliance posture.

Third concept is Production Database Protection:  While replacement of sensitive data – specifically through ETL style deployments – is by far the dominant model, it is not the only way to protect data in a database. At some firms protection of the production database is the primary goal for masking, with test data secondary. Masking can do both, which makes it attractive in these scenarios. Production data generally cannot be fully removed, so this model redirects requests to masked data where possible.

Within masking solution many masking methods can be used for data obfuscation such as: data shuffling, data substitution, randomizing, nullifying and encryption.

 

Asset: Fraud Detection and System Health Monitoring Use Case

FERARI architecture has been successfully adapted to work with data used in fraud detection in HT, namely CDRs and alarms currently in use by HT in order to detect fraud and also, the FERARI architecture has been adapted to process data related to network equipment in use by HT. Additionally, GUI has been created to enable users to monitor data processing and data drilldown. As such, final FERARI architecture should enable easy integration in the fraud detection process in HT.

 

Asset: Smart City Rijeka with IoT and Big Data

A smart city is a concept which integrates multiple information and communication technology (ICT) solutions in a secure way to manage a city’s assets. City’s assets include, but are not limited to, transportation systems, waste management, water management, safety systems, local departments information systems and other community services, and data management as well.  Idea of the Smart city Rijeka project is to made traditional networks and services more efficient with the use of digital and telecommunication technologies, with emphasizes on Internet of Things (IoT) and Big Data, to improve quality of life of Rijeka’s citizens and for the benefit of its businesses.

Technological lead of the Smart City Rijeka project is Ericsson Nikola Tesla. Hrvatski Telekom is major and most important partner. Along with Hrvatski Telekom, there are dozens of other partners on the project, mostly small to medium companies with good reputation and great understanding of local economy, and few faculties and other educational institutions.

Under the umbrella of Smarty City Rijeka, there are 8 interdependent projects as follows:

  1. Smart Platforms
  2. 4D Intelligent infrastructure
  3. Smart Transport and Mobility
  4. IoT
  5. Big Data Services
  6. Tourism
  7. Energy Efficiency
  8. Smart Economy

Hrvatski Telekom is project lead for 5. Big Data Services and takes part in other projects as a partner providing consultancy services, business or technological support.

Scope of the fifth project is development of portal and mobile apps with various data sets and analysis for citizens, public institutions and business subjects in City of Rijeka. HT wants to build and develop Big Data Advanced analytics competencies and provide services for Smart City.

Idea is to connect to various data sources from private and public sectors, perform various analysis and make data sets and results of analysis available for citizens and business subjects which operate in the City of Rijeka. Big Data technology and tools play vital role in the project since there is huge amount of data to be processed and analysed. Substantial amount of data will require real time or near real time processing. Also, sources of data are very heterogeneous and data has to be systematized, unified and adjusted for further use. Big Data system will connect to various sensors spread throughout the City of Rijeka. Ericsson Nikola Tesla is developing Internet of Things (IoT) platform which will serve as a backbone for the sensors. Sensors will be integrated in traffic lights, traffic signs, public transportation vehicles, phone booths and will measure intensity of traffic, quality of air, pollution etc. Data lake will store the data and advanced analytics will be applied to pair the data from the various sources and build knowledge, make conclusions and predictions on different subjects such as route optimization and mobility, public safety, innovative pricing models for marketing agencies and advertisers based on crowd and target group movements etc. IoT platform will be operated in cloud provided by Hrvatski Telekom. Hrvatski Telekom will also provide billing, provisioning, storage back-up and fault management for the IoT platform. IoT platform with its inputs and outputs is described in picture below. Magenta (pink/purple) boxes represent services directly provided by Hrvatski Telekom.

Data privacy and security are also the important topics to be addressed. For that purpose Hrvatski Telekom developed Trust Center as a key enabler for data pseudo anonymization and anonymization. All personal data processed and analysed in use case has to be pseudo anonymized.

To obtain pseudo anonymization, data streams from various sources have to go through the Trust Center before landing in Data Lake.

 

Asset: FERARI Architecture in Call Anomaly Detection

Big Data platform developed by T-Mobile Poland (TMPL) enabled HT to pilot implement smart network monitoring in order to detect anomalies in call performance and indirectly affect customer experience.

The pilot project revolved around “silent” calls. HT is struggling with customer complaints for the last several months with no successful problem resolution. There are only few systems capable of decoding speech and allowing analysis of “silent” calls and their implementation requires very big investment. TMPL Big Data platform uses heuristic approach based on analysis of user behaviour.

Input data requirements for this use case were also anonymized CDRs from the whole network. A total of 70 million call records were analysed. Analysis was executed on Teradata Aster discovery platform.

Analysis showed that there is significant number of “silent” call events in HT network. There is a trend based increase of “silent” calls in given cells which can predict network quality disturbances and can predict some of them. Day based peaks can indicate network problems.

The project discovered a lot of cells with significant daily average number of “silent” calls which require additional analysis. Location based reports showed differences in “silent” calls per region in Croatia.

 

Trial conclusions:

  • TMPL trial results show valuable information about silent calls affected cells (high level view);
  • Detailed analysis (daily trends and peaks, per day and per cell) could be provided further, assuming project is set up;
  • Next steps in optimization can be undertaken already; and
  • Based on trial results, more detailed analysis and other use cases would be beneficial to HT.

Additional processing of the data concluded that affected cells spread all over the territory and that femto cells, tunnels, cities, coast are more affected (drop calls combined). Due to this it could be beneficial to implement FERARI architecture in order to handle this data spread, i.e. information from all the cells in the country.

These potential use cases should be able to handle a vast amount of data (CDRs, network information, etc.) and therefore it should be beneficial to implement FERARI architecture to handle these types of events in a fast and efficient manner.