Closer to Real-Time Kafka Processing

Recently I had to ponder how to do processing for a security system with stringent performance requirements. Like a lot folks these days, my first thought was to use Kafka. Kafka’s features are well known and don’t need repeating here.

There was, however, a red flag. As I observed the various systems and applications using Kafka, most of these centered around backend processing where there were not real-time or pseudo real-time performance requirements.  This lack of near real-time applications made me somewhat uneasy. I didn’t want to commit to Kafka only to find that it could scale and be extremely responsive. In particular, security requests needed to be answered in milliseconds.

So what do do?

Now I came up with an approach…

Caveat: I don’t know if the approach described here is novel or not. It could be novel, or it could be rediscovery…In any case, here is the method, starting with its inspiration.

Now a lot of systems these days use the following pattern. Nodes in a cluster propagation information horizontally, with node X replicating data to node Y using some fast communication channel. Then both node X and node Y persist the data. Sometimes, the data from a node is replicated to a plurality of nodes, and then each node can persist its data to its own persistent store.

This design pattern is responsible for a lot of the performance in some NoSQL systems.

It occurred to me that this design pattern could be expanded upon. That is, suppose a node X has a Kafka log. As messages are appended to the long on node X, the appends are also pushed to related node(s) that have replicated the same Kafka log. Only after this horizontal replication completes is the log persisted.

[Of course, it is possible to not persist the logs at all, but in many preferred cases, the logs would be persisted.]

Once the Kafka log has been replicated, processors can then process the messages. Other data structures, such as Hazelcast or bloom filters etc. may then be used to determine if a message has been processed or not. (This is important for messages that are not idempotent).

Anyway, that’s the core of the idea. Take a well known design pattern and evolve it in a (potentially) novel way so that the Kafka logs can be processed more quickly, and with great resilience, due to the horizontal-then-vertical replication strategy.

Again, I do not know if Kafka does anything like this currently. I am fairly new to Kafka, but as far as I can tell, it does *not* do this.





Restructuring Code

When I attacked the problem of large-scale text classification, my focus was always on solving certain core problems. That last sentence is more than a little defensive. Why? Well, I did solve those core problems, but there were other things I did not attend to.

In particular, while the Topic Scout implementation has many Java modules, I did not anticipate the need to separate the runtime aspects of Topic Scout from its discovery aspects. What does this mean?

Well, the true heavy lifting of Topic Scouting is discovering words and phrases relevant to a topic, and do this so this relevance stands up when there are many thousands of topics, if not hundreds of thousands. What I did not do is anticipate the need to separate the runtime, where text is analyzed to determine its topic, and nothing more.

Now, I would like to release software that would allow users to classify documents by topic, but I do not want to release the software that actually does the data mining.

Unfortunately, isolating the runtime is much harder than I’d like. What I need is a software tool that can help me restructure the code.

In particular, suppose I have N modules, and I want to release one of these modules while minimizing dependencies on other modules. Well, given the current inter-dependencies of the code, I can’t just simply add dependencies on the other modules. Why? Those dependencies would quickly suck in all or almost all of the other modules, including the proprietary modules relating to data mining.


What I actually need is an interactive tool that would would help me determine which classes need to be added, but also warn me, for each class, how many additional classes, by the transitive closure, that would be “sucked in” by adding the dependency.

I have no such tool. So instead, I am doing this manually. It is time consuming.

I did look at one restructuring tool, and it offered restructured based on class coherence in the existing code base. I didn’t pursue that tool. Why? Well, the current coherence is wrong. Too many entanglements. I don’t really want a tool to just to analyze the current class coherence, I want a tool that understands that I want to isolate a module, and helps me do this *despite* current class coherence.

Instead, I am discovering dependencies, and initially creating duplicate classes, one by one, where necessary. My hope is that these duplicate classes will form a new shared module X, such that my runtime can depend on module X, and my data mining modules can also depend on module X, and in this way I can separate out the runtime.

Hopefully this will work out!



Securing data when there is a mix of structured and unstructured information

I recently attended a meeting on securing data using classifications. See

The open source project involved is Apache Ranger. Unfortunately, the Apache Ranger architecture, at present, is quite limited. It doesn’t support restrictions on queries (e.g. only show rows for the Accounting department) and so is very limited use. Maybe someday Apache Range may evolve into something more…

But going to this talk did get me to thinking how hard it is control access to data these data.  So many companies need to limit access to sensitive information, but sensitive information doesn’t just live in fields and columns, it lives in documents and in text. So what does it matter if you prevent access to a SSN column in a table if an e-mail related to that use contains the user’s social security number?

Of course it is very challenging to actual detect – without lots of false positives – a social security number in a document. So I am not saying guarding this unstructured information is easy.

What I am saying is that software – commercial or not – that claims to protect sensitive information may need to acknowledge that sensitive information in unstructured information is a weak link.





Sample Page

This is an example page. It’s different from a blog post because it will stay in one place and will show up in your site navigation (in most themes). Most people start with an About page that introduces them to potential site visitors. It might say something like this:

Hi there! I’m a bike messenger by day, aspiring actor by night, and this is my blog. I live in Los Angeles, have a great dog named Jack, and I like piña coladas. (And gettin’ caught in the rain.)

…or something like this:

The XYZ Doohickey Company was founded in 1971, and has been providing quality doohickies to the public ever since. Located in Gotham City, XYZ employs over 2,000 people and does all kinds of awesome things for the Gotham community.

As a new WordPress user, you should go to your dashboard to delete this page and create new pages for your content. Have fun!