Confluent Platform

Confluent Platform is a stream data platform that enables you to organize and manage the massive amounts of data that arrive every second at the doorstep of modern organizations in various industries : retail, logistics, manufacturing and financial services, online social networking.

The Confluent Platform is a collection of infrastructure services, tools and guidelines for making all data readily available as realtime streams. By integrating data from disparate IT systems into a single central stream data platform or “nervous system” for the company, the Confluent Platform lets focus on how to derive business value from data rather than worrying about the underlying mechanics of how data is shuttled, shuffled, switched, and sorted between various systems :

_images/confluent-platform.png

At its core, the Confluent Platform leverages Apache Kafka, a proven open source technology created by the founders of Confluent while at LinkedIn. Kafka acts as a realtime, fault tolerant, highly scalable messaging system. Its key strength is its ability to make high volume data available as a realtime stream for consumption in systems with very different requirements – from batch systems like Hadoop, to realtime systems that require low-latency access, to stream processing engines that transform data streams immediately, as they arrive.

Out of the box, the Confluent Platform also includes a Schema Registry, a  REST Proxy, and integration with Camus, a MapReduce implementation that dramatically eases continuous upload of data into Hadoop clusters.

More detailed information about Apache Kafka can be viewed here

Schema Registry : one of the most difficult challenges with loosely coupled systems is ensuring compatibility of data and code as the system grows and evolves. With a messaging service like Kafka, services that interact with each other must agree on a common format, called a schema, for messages. While changing requirements, it becomes necessary to evolve these formats. The Schema Registry enables safe, zero downtime evolution of schemas by centralizing the management of schemas written for the Avro serialization system. It tracks all versions of schemas used for every topic in Kafka and only allows evolution of schemas according to user-defined compatibility settings. This gives developers confidence that they can safely modify schemas as necessary without worrying that doing so will break a different service they may not even be aware of. The Schema Registry also includes plugins for Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format. The integration is seamless – if Kafka is already used with Avro data, using the Schema Registry only requires including the serializers with an application and changing one setting.

REST Proxy : many organizations use languages that do not have high quality Kafka clients. Only a couple of languages have very good client support because writing high performance Kafka clients is very challenging compared to clients for other systems because of its very general, flexible pub-sub model. The REST Proxy makes it easy to work with Kafka from any language by providing a RESTful HTTP service for interacting with Kafka clusters. The REST proxy supports all the core functionality : sending messages to Kafka, reading messages, both individually and as part of a consumer group, and inspecting cluster metadata, such as the list of topics and their settings. One can get the full benefits of the high quality, officially maintained Java clients from any language. The REST Proxy also integrates with the Schema Registry. It can read and write Avro data, registering and looking up schemas in the Schema Registry. Since it automatically translates JSON data to and from Avro, one can get full benefits of centralized schema management from any language using only HTTP and JSON.

Camus : it is a MapReduce job that provides automatic, zero data-loss ETL (extract-transform-load) from Kafka into HDFS. By running Camus periodically, developer can be sure that all the data that was stored in Kafka has also been delivered to data warehouse in a convenient time-partitioned format and will be ready for offline batch processing. Camus is also integrated with the Schema Registry. With this integration, Camus automatically decodes the data before storing it in HDFS and ensures it is in a consistent format for each time partition, even if the data contains data using different schemas. By integrating the Schema registry at every step from data creation to delivery into the data warehouse, developer can avoid the expensive, labor-intensive pre-processing often required to get data into a usable state.

Leave a Reply

Your email address will not be published. Required fields are marked *