Performing Deduplication in Real Time streaming pipeline with Apache Beam stateful processing

Krishna Konar
6 min readJul 18, 2020

Apache Beam

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines and with the help of Beam SDKs, Pipelines can be built for distributed processing back-end. When it comes to developing a complex streaming pipeline, Apache Beam is a pre-eminent solution.

The most common problem being faced in Apache Beam streaming pipeline by most of the developers out there is Duplication. Even I faced the same issue while developing the streaming pipeline. As most of the messaging services out there guarantee at-least-once message delivery, but in most of the cases it doesn’t guarantee a single delivery of message per subscriber.

How to handle Duplicate Data in Streaming Pipeline?

The most straightforward solution for this kind of stumbling block is to perform deduplication on database end. For E.g. In case you are dumping streaming result in BigQuery, use row_number() over windowing to ignore the duplicate data, which is a pretty good solution. But the better solution would be handling the duplicates in the streaming job so that you don’t have to worry about the redundant data at your database side.

In this blog, we will focus on the solution to steer clear of duplicate data in the streaming pipeline. Here I’m going to use Apache Beam Stateful processing methodology to preserve the key state to handle duplicate data.

Stateful processing with Apache Beam

Stateful processing in the Beam model expands the capabilities of Beam, unlocking new use cases and new efficiencies.

To illustrate the stateful processing occurs in the context of a single key — all of the elements are key-value pairs with the same key. With Stateful Processing you can preserve the state of key. Each element will go through a stateful DoFn method and the state of each key value will be preserved.

Timely processing complements stateful processing in Beam by letting you set timers to request a (stateful) callback at some point in the future. Stateful and timely computation is the low-level computational pattern that underlies the others. Precisely because it is lower level, it allows you to really micromanage your computations to unlock new use cases and new efficiencies.

Implementation

To implement this solution we will use Cloud PubSub as a real-time messaging service and Apache Beam as a streaming pipeline.

First we will read data from PubSub subscription and deserialize the message in the required format inside the ParDo. Simply use PubSubIO api to pull message from PubSub subscription.

Convert your input pair to Key-Value format. Since stateful DoFn preserves the state of key, it’s imperative to provide the input in PCollection of KV format. Note: You can either use a single value as key or combination of key or the complete message as key, it totally depends on the use-case.

As you see in the above code snippet we are scrutinising whether key has been observed previously in the stateful DoFn or not. If not, DoFn will preserve the key state and output that particular record. In case the state of key Value has already been observed in DoFn, it will just ignore that particular record and consider it as duplicate one. You could either avoid the duplicate record or create a dead letter Tuple to output the duplicate record and store it in either GCS or any other data storage service.

With TimerSpec you can preserve the key state for suitable timing. So stateful application will preserve the key value for specified hours and prevent deduplication for the given time. Once the window is passed we will clear the key state as shown in the above snippet.

Demo

Step 1: Run the dataflow streaming job. You can either run the pipeline using maven command or you use any IDE to directly run the job. Link

I ran directly my Beam job in eclipse. You can see the running job in Dataflow console as shown below.

Step 2: Publish message to Pub/Sub topic. Normally we will get our streaming data in json Format. So the first step would be deserializing and perform data checks for validation and then convert it into suitable PCollection object for further processing in Apache Beam pipeline. To showcase the demo I’m going to use the simple text format as input data and perform deduplication on the entire text. I’ll suggest instead of using the entire text as key use some particular combine field as key. For E.g. combination of customer_id and status. To perform this demo, I’m manually publishing the below messages to PubSub topic. In real time scenario, you will either receive IOT data or real time analytics data which could be game stats, sensor data, etc.

Step 3: We have an Input Data Check transformation in Dataflow pipeline to validate the input data. I have added a logger in this step to display the input message.

In the next step we will convert the input to KV pair and then apply deduplication DoFn method to this input. For this demo I have considered the entire text as Key, so that it will preserve the state of entire text and deduplication will be applicable on the entire text. So I have published 2 messages and both the messages is unique so ParDo will output both messages.

Step 4: We will publish the same message to check how deduplication logic works inside Cloud Dataflow streaming job. So I published 4 messages to Cloud Pub/Sub topic with 2 duplicate values. You can see all the messages in the log section.

As you see in the below snapshot, Deduplication DoFn accept 4 messages as an input and outputs only 2 unique messages. It neglected the duplicate message and outputs only unique messages. I wrote the logic to avoid the duplicate message, you can either put the duplicate messages into dead letter tag for further analysis or you can avoid it totally depends on the use-case.

That’s it!!

What we saw today was just scratching the surface. There is so much more to be learned with Beam.

What we know is a drop. What we don’t know is an ocean. Same is applicable for Apache Beam, there is so much to explore.

Thank you so much for reading. I hope you learned something valuable, and I’ll see on the next one.

--

--

Krishna Konar

Professional Data Engineer with expertise in Big Data technologies and GCP Cloud.