Is event streaming or batch processing more efficient in data processing? Is an IoT system the same as a data analytics system, and a fast data system the same as a big data system? These questions stem from two philosophies of data analytics architecture: state orientation and event orientation. This post will reflect on both concepts and their evolution, as well as highlight the most recent event streaming capabilities found in ksqlDB.
In 1887, Dutch paleoanthropologist Eugène Dubois found the fossil remains of “Java Man” (Anthropopithecus erectus) in Indonesia. He soon proclaimed that he had found the missing link, the evolutionary species that served as the connection between apes and Homo sapiens. Later, new discoveries appeared such as “Lucy” (Australopithecus afarensis)—also thought to be the longed-for link at one point—but the idea of a linear evolution of the human species was eventually abandoned during the 20th century. It was replaced by the theory of a branching tree evolution, where development was believed to be diversified into different species with similar characteristics that coexisted over time.
Within data analysis technologies, also subject to continuous evolution, we can distinguish two main currents that have inspired data architecture: state orientation and event orientation.
Under the first paradigm is the classic three-layer architecture: service-oriented architecture (SOA), microservices, and traditional business intelligence (BI) systems, in which data is exposed in the form of services and BI universes or operational APIs, which are consumed by applications or analysts. In this variant, data is requested (pulled) to know the state of an element or set of elements at a given time.
Under the second paradigm is event-driven architecture (EDA), or fast data, as well as the centralized architectures of IoT and streaming analytics. In this case, the data is delivered (pushed) in the form of events that report changes in state. Jerry Mathew explains these concepts in detail.
The technological implementation of state orientation versus event orientation varies.
In the case of state orientation, it is based on a database system that acts as a source of truth. This system’s data is exposed to applications through synchronous calls or queries via SQL or an API.
With event orientation, fundamentally the event manager or queue manager allows the applications to store and distribute events in an asynchronous way. In the IoT world, there are also local Data Distribution Service (DDS) technologies that allow for peer-to-peer networks to distribute events, which are very useful in applications that want to avoid centralized dependencies.
If the application requires state orientation and event orientation simultaneously, it is necessary to implement both in the architecture. Any modern IoT or big data platform typically incorporates elements for both event streaming and batch processing with push and pull query mechanisms, at the expense of introducing complexity into the architecture.
This section will focus on the evolution of event streaming technologies and data analytics into new intermediate architectures, which adopt elements of both event streaming and batch processing.
Similar to what some believe are the first pre-hominids like Ardipithecus ramidus, which maintained the characteristics of an ape but began to adopt the first human traits, databases are beginning to develop capabilities for reporting events and vice versa.
As we continue on this evolutionary path, Confluent, founded by the original co-creators of Apache Kafka®, introduced a new database for event streaming called ksqlDB. This technology turns the well-known streaming SQL engine into a powerful event streaming database with real-time querying functions.
Imagine, for example, an IoT application used to collect changes in the state of a vehicle’s sensors. ksqlDB allows, by means of SQL code, the creation of stream-type tables that register in an immutable way a log of all the events. In addition, it incorporates numerous connectors for automatic synchronization with very diverse sources.
But if we want to know the status of the sensors at a given time, ksqlDB can materialize the previous stream in a view or table, in which the key is the ID of each sensor. Thus, the most up-to-date sensor value appears in different columns. In this way, the application can obtain data via SQL using a classic pull query to know the status of a sensor:
SELECT instant_velocity, current_latitude, current_longitude FROM sensor_assets WHERE vehicleid = '19012015' AND sensorid = "01031975";
But it is also possible to launch the query to continuously subscribe to status changes by simply adding the EMIT CHANGES statement:
SELECT instant_velocity, current_latitude, current_longitude FROM sensor_assets WHERE vehicleid = '19012015' AND sensorid = "01031975" EMIT CHANGES;
In this way, the application continuously receives the events of state changes, which allows for alerting or application of a machine learning model.
ksqlDB also enables high performance, scalability, high availability, fault tolerance, and low latencies. Is ksqlDB just another pre-hominid in our technological evolution, or is it the real missing link between big data and fast data systems?
Open source innovation is accelerating exponentially, and ksqlDB is an ongoing effort to simplify your architecture.
Ready to check ksqlDB out? Head over to ksqldb.io to get started, where you can follow the quick start, read the docs, and learn more!
A version of this post was originally published by Guillermo Gavilán on the Empresas blog.
Building a headless data architecture requires us to identify the work we’re already doing deep inside our data analytics plane, and shift it to the left. Learn the specifics in this blog.
A headless data architecture means no longer having to coordinate multiple copies of data, and being free to use whatever processing or query engine is most suitable for the job. This blog details how it works.