Publishing real-time financial data feeds using Kafka-Toy模板网

这篇具有很好参考价值的文章主要介绍了Publishing real-time financial data feeds using Kafka。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Good morning and welcome to this session on publishing real time financial data feeds using CCA. If you're a data feed provider, you may already have customers who are asking you to deliver your feed directly on AWS. And by the time we end this session, you should have a pretty good understanding of how to do that.

My name is Rana. I am a Principal Solutions Architect at AWS. I am joined today by Diego. He's a Senior Solutions Architect at AWS. Both of us are based in New York City and both of us specialize in topics that are related to the financial services industry.

Thanks for being here on a Wednesday morning.

All right. So here's what we've got lined up for this session:

We'll first quickly go through the different types of financial data streams that are out there
And then we'll get into who is using Kafka today and why is Kafka gaining so much traction within the financial services industry?
After that, we'll talk about Amazon Managed Streaming for Kafka or MSK, which is the service that we built to make deploying Kafka easier on AWS.
And then we'll get into the meat of it. I'll hand it over to Diego and he's gonna talk about how you can use MSK to publish your real time financial data feed on AWS and also how you can monetize it as well.
After that, we'll get into the security model - like how are you going to authenticate and authorize your clients using CAFKA?
And finally, we'll end with some operational aspects - like for example, how would you monitor what's going on and how would you automatically deploy your solution?

All right. So what are the different types of financial data streams that are out there?

When most people think about financial data streams, they think about stock quotes. And it was back in the late 1860s that the ticker tape was invented and that basically was using telegraph wires to transmit stock quotes and print them on this narrow strip of paper, uh which is called the ticker tape and that technology, believe it or not, that that lasted decades. Uh it wasn't until the mid 1960s that the ticker tape became electronic. Uh but even back in that era, uh these quotes were delayed, something like 15 or 20 minutes.

Nowadays, of course, everybody expects to log into their brokerage account and get quotes delivered in real time, but it's not just stock prices that get delivered in these financial data streams. You also have prices of options and commodities and crypto assets. And you know this list is by no means uh comprehensive.

Another type of financial data feed is one that carries corporate actions and company news type of items. So for example, uh if a company announces a merger or an acquisition, you know, they're gonna buy another company or if a company like Amazon, you know, announces a stock split. Well, these are all events that investors are interested in. Uh so this is yet again uh information that can be conveyed within the financial data stream.

And then there's the third type of data stream, which I find particularly interesting. And that's the one that has analytics embedded within the data stream. So for example, let's say that you have a raw data feed of stock quotes, you can overlay your own intelligence on top of it. So for example, you can take a stock price and market with an indicator that says that, ok, this price has just crossed over a 200 day moving average, for example. So if you can do something like that, now you've created an enriched data feed. Uh something that your customers are gonna find valuable. It differentiates you from other providers for providing uh you know, data feeds. Uh and it's something that you can then monetize.

All right. So what are the use cases for some of these uh different uh financial data streams?

Well, the most obvious one that comes to mind is trading and you know, we'll talk about when it makes sense to put the trading applications uh on an AWS platform or cloud versus putting uh the trading applications at a data center that's located closer to the exchange.

And then a second type of use case is surveillance and fraud detection. So this is where, you know, you might have a client who's in the regulatory business. And uh they're interested in making sure that nobody is trading on insider information that you know, they're, they're not trading on information that's not public. So that's another important use case.

Uh third use case is risk assessment. So this is where a financial institution will hire risk analysts. Uh their job is to make sure that none of the traders are putting on trades that are going to put the entire company at risk. Uh a situation like this, for example, happened in the late 1990s during the Asian financial crisis when a hedge fund called Long Term Capital Management went under because some of the traders there put on some trades that were particularly risky. Uh so that was a ex extremely, you know, expensive event because not only did uh investors uh lose a lot of money, but it, it cost the government some money as well. So it's certainly something that you don't want to see repeated. Uh and that therefore this becomes yet another use case for looking at financial data streams.

Uh 1/4 use case for financial data streams is post trade analytics and back testing. So this is where a financial institution will hire a quantitative analyst. And their job is to create a sort of a trading model uh with back testing where the model might have some factors, you know, like price earnings ratios or it could have like six month momentum type of criteria. And the model is trying to deliver some sort of alpha, uh you know, some sort of an edge over the market. Uh so again, this is another important use case for financial data streams.

Now, of course, uh when you look at, you know, applications like trading, uh you have to consider network latency. And um I'm gonna talk about where it makes sense to have trading applications hosted on AWS versus where it might make more sense to host them closer to the to an exchange.

So there's all kinds of scenarios and let's let's say that you're a financial data feed provider and your customer is a brokerage or a fund. Uh now, if your customer is engaged, for example, mainly in high touch retail trading or they're trading uh mainly on fundamentals, then they don't necessarily need low network latency to the exchange. Uh and these then are ideal use cases for consuming uh market data in the cloud, right? Uh so this is kind of what we're gonna be focusing on.

And then there's other use cases where you might have customers who insist on low network latency. So this is where the latency from the point at which the trade gets executed to where the data is available is at the a few milliseconds, right? Uh or ideally a millisecond of network latency. So this is a domain of low latency. It typically needs an architecture that's more hybrid where there's some component of the architecture that's closer to the financial data exchange uh in geographically.

And then you have the domain of ultra low latency. So this is where you have a situation where people where you know your customers might be asking for sub millisecond network latency from the point that the trade gets executed to the point that the data becomes available.

So let's look into this in a little bit more detail. This is kind of how market data uh is getting distributed in a, in a, in a, in a traditional manner. Uh you've got on the left hand side of the diagram, you've got the uh exchange data center and there's usually something, something like a ticker plant and a gateway and then that's delivering uh the market data using multicast right to a consumer.

Uh and in this particular case, if you look at the blue rectangles, if the, if the consumer is engaged in activity like high high frequency trading or algorithmic trading, and they need that sub millisecond latency, then the consumer needs to be co located right at the exchange, right? They're co located right at the exchange, they're gonna get that sub millisecond latency.

But of course, if they're located there, that's pretty expensive, you know, to rent space at the exchanges, data center, not only is it expensive, it's kind of complicated because you're having to consume data that's coming off a multicast router. So you're gonna have to handle that whole, that multicast complexity, right?

Uh so what's the alternative? Well, uh if you're willing to tolerate a few more milliseconds of network latency, then you can have your trading applications out at a data center that's still close enough to the data exchange that the, you know the the network latency is in the few millisecond range. And in this diagram, we're showing that that data center on the bottom right is separated from the exchange by a single fiber hop, right? Or it could be an MPLS network.

Now, one thing that has uh become available recently is the advent of AWS Local Zones. So a Local Zone then is an alternative to hosting applications uh that do trading at a data center close to the exchange. Because with a Local Zone, you've got a lot of the core features that you expect from a regular in region Availability Zone. For example, you get EC2s on demand, you can run containers on demand, you get the familiar IAM uh security underpinnings.

And so now you have an environment sort of a AWS cloud environment that's close enough like the New York Local Zone is close enough to the Nasdaq data center and the NYSE data center that you are going to meet your low latency requirements by hosting uh the trading applications over there, right? Because because Direct Connect has become available and you can you can connect uh via Direct Connect from a New York Local Zone right into the exchange.

Uh but you know, the focus of this talk is not necessarily on, on trading. It's more how do you make uh data that's the market data available to clients, right? Who don't necessarily need this type of low latency? I already mentioned some of the use cases. Uh use cases like post trade analytics or use cases like surveillance don't necessarily need low latency. And these then are ideal use cases for publishing your market data feeds directly on AWS. And that's what we're gonna talk about.

Uh we're gonna focus on in this talk. So how do you do that? So the architecture that's shown here is an example of how you could, you could accomplish this. On the left hand side, you could still have your market data originating from a data center that's close to the financial data exchange. Uh you could stream that financial data uh through a Kafka client and deliver that to a regular AWS region like for example us east one.

So in the middle of the diagram, you've got uh a region like us east one hosting a Kafka cluster which is managed by Amazon MSK manage stream for Kafka. And I'm going to talk about what that is uh in a couple of minutes. But as a result of you being able to stream your market data directly from AWS, you now can make that data available to customers who are not only on AWS but also outside AWS.

So what this diagram is showing on the right hand side is a customer who's already on AWS. And you'll notice that the your customer can consume your data feed using a regular Kafka client, right through a VPC endpoint through this private link technology.

So AWS PrivateLink is this technology that enables uh data from your account to flow to your customer's account within the AWS private network without ever going over the public internet. So PrivateLink is a secure means of delivering this data and staying within the AWS network.

Uh you could, you could say that PrivateLink actually is enabling a SaaS solution because your customers in a different account. Uh SaaS is Software as a Service. But actually what we're talking about here is actually real time data streaming as a service, right? That's what we're talking about PrivateLink, enabling real time data streaming as a service.

So, uh of course, it's not enough, you know, just to publish your data to your customers, you wanna make money off of that too. How do you do that easiest way to do that is to integrate AWS Marketplace into your application. So we'll get into this in a bit more detail. But Marketplace is a way to have your application exposed to a broad array of companies who might be interested in purchasing your solution. And Marketplace comes with a lot of um quick starts and uh registration type of applications that can uh help you enroll your, uh your users quickly.

Now again. Ok. Who are some of the real time financial data feed providers who are already on AWS?

Uh you've got um you know, the big exchanges like uh the Nasdaq and so are are on uh streaming data. Three AWS. You also have the large market data feed providers like Refinitiv and Bloomberg. Uh customers of Bloomberg actually have been able to consume uh Bloomberg data through their uh B-Pipe API on AWS for uh for three years now.

Uh here's a quote from Lauren Dillard, who's the EVP uh and head of Global Information Services at the Nasdaq. She says "Nasdaq Cloud Data Service is a significant advancement in the financial data space as it uses cloud to stream important real time market data tailored to our client specific needs. Adding the cloud to the duty equation through our collaboration with AWS is a big win for investors."

All right. So what is Apache Kafka? And why is it gaining so much traction within the financial services industry?

Well, first of all Kafka is an open source distributed event streaming platform. So let's unpack what that means. Open source of course means there's no license fees. Uh the team uh who was at LinkedIn, who built Kafka open sourced it back around 2012 and it's a distributed event stream platform. So the message delivery mechanism, the implementation of a Kafka cluster happens in multiple parallel nodes. And it's the parallelism that enables a Kafka cluster to deliver your financial data stream with high throughput and low latency.

So as an example of this couple of years ago, the team at Apache Kafka ran a benchmark, uh they used uh three i3 en instances on AWS and with that three node cluster, they were able to push about 600 megabytes per second of data through that cluster, right?

Uh what's remarkable is that it's not just a s it's not just a 600 megabytes per second through a three nil cluster. It's the fact that the latency from the producer to the consumer in that experiment pretty much stayed at five milliseconds or less 99% of the time, right?

Um so it, it's so Kafka delivers high throughput with low latency with low jitter as well. So, jitter is the, you know, the variability of the latency is extremely low. So these are all characteristics that you want. When you're streaming your financial data stream, you want the high throughput, you want the low latency and you want extremely low variability within that low latency as well.

So uh another characteristic that makes Apache Kafka superior to other message streaming services for financial data feeds is the publish and subscribe structure of the topics and partitions. So with Kafka, you have multiple topics that your customers can subscribe to. Uh so an example of this, you can have uh you know, stock quotes being one topic and commodities prices being a separate topic.

Uh but within the topic, you also have partitions. So if your customers are interested in subscribing to a particular ticker, right, like a QQQ or something, then you can map that ticker to a partition and thereby the consumers can efficiently read the data from the cluster.

So it's this two level scheme of topics and partitions that makes Kafka ideally suited for financial data streams. Ok?

What is Amazon MSK manage for Kafka? We built MSK to relieve our customers from the burden of creating managing and deploying a Kafka cluster. And in particular, uh you know, if you want to set up a Kafka cluster from scratch, you have to deal with the underlying control plan which is based on Zookeeper.

Uh you have to uh patch the operating systems and you know, there's a lot of effort in just launching the control plane and managing all the instances. Uh so uh we took that burden away from you, I mean, you know, we all this Zookeeper control plane management that's, you know, we call that undifferentiated heavy lifting.

So we took, we take that away from you. We give you a solution where you just have to focus on how your application is gonna queue messages to uh MSK and how they consume messages uh from MSK.

Uh now, uh it's fully compatible with Apache Kafka. You can, you know, use the regular SDKs that you are familiar with uh with Kafka to use it and it's highly secured. Diego is gonna talk about the security model.

One thing I should say about the scalability of MSK, a single broker node in MSK can host up to 4000 partitions and a single broker node again on MSK can manage up to 1000 megabytes per second of throughput, right? So that gives you an idea of the scale of MSK that a single broker can manage these kinds of statistics.

So here's the difference between running Kafka on premises versus running it on EC2 versus running it on MSK.

If you run Kafka on premises, then you have to worry about everything from the hardware and the life cycle of that hardware to installing the operating system and patching it and so on.

If you run it on EC2, then at least you don't have to worry about the hardware anymore or the initial operating system install, but you still have to worry about patching and version upgrades and things like that.

But if you run Kafka on MSK, you don't have to worry about any of these low level housekeeping chores. Uh you, we manage that all of these things for you.

Uh not only that we give you high availability out of the box. So as soon as you uh launch your cluster, we can distribute those nodes across different availability zones. So you're gonna be resilient right from the start. Uh and we also have mechanisms to scale the cluster as well. So uh we launched, we launched uh ms k three years ago, our customers asked us, is there some way to make uh uh c a fa even simpler? So we offered serverless earlier this year with serverless, there's no servers to manage. Uh we give you a default capacity to start with and then we grow and shrink that uh depending on the traffic that's hitting the cluster. Serverless is again, fully compatible with the patchy kafka. And it's got the same security model as ms k and it's got uh pay for throughput pricing.

All right. So now that I've explained how cafo works and how it's used in financial services, I'm gonna hand it over to diego and he's gonna explain how to use ms k to publish your financial data feed.

Ok. Uh thank you rana. Um we went through uh financial data feeds. Uh what are the use cases. What is a patch? Kafka? Now, we're gonna shift gears a little bit and we're gonna talk about how you would deploy uh that financial data feed uh infrastructure into aws leveraging ms k. Before we go there, we gonna do a really quick review to understand a few of the uh components of kafka that are important uh for how you design that cluster, right?

So when we look at apache kafka, the cluster is based out of two major components, the brokers and the zookeeper nodes, the zookeeper nodes, they are responsible to handle all the management piece of the cluster, the orchestration piece of the cluster, the broker knows they are responsible to handle the data. So producers which are essentially applications clients that are sending data to the cluster, they uh send data, the broker receive that. And broker node is responsible to store that data replicate that data and guarantee the high availability of your data. It's also responsible for the network connectivity. So the brokers are the ones that are actually delivering data uh in and out from the cluster to uh from producers to consumers. So that is important to understand uh the responsibilities of those things within the classroom.

When it comes to a cluster, we're i mentioned the topics, topics are that container uh of your data within the cluster. So in this example, i have a cluster with six brokers. When i create topic, one, that topic uh will have a replication factor. Uh the cluster has a replication factor by default that you configure if you want to have a specific replication factor for a specific topic, you can put on the command. In this case, i have a replication factor of three, which means your data data is gonna land into three different nodes. That is just to assure the high availability. If a no uh broker goes down, you have the data somewhere else, right? As you keep creating more topics, uh those topics will get distributed across the cluster. So that's how you will achieve the parallelism and the high availability that uh c fa uh makes available to customers by spreading those topics across uh different brokers.

So let's say that you have a specific topic. And if you remember um the broker that will replicate the data that is called the leader broker. So the leader broker is the one responsible to handle all the network connectivity. The leader uh is the one that is receiving data from producers and sending data uh to consumers. They can get to a limit, right? The resources are limited that server can get to a limit. So if i need to scale a topic even further, there is another lever that you have, it's called the partitions. So you can create a topic that has multiple partitions. And what is gonna happen is the cluster will get these partitions and treat them almost as topics. And they will get spread across the cluster. So now for topic 200 i have four partitions means i have four brokers handling network connectivity for that specific topic. So if you want to increase throughput for a specific topic, let's say you are producing too much and that broker cannot handle the amount of throughput that you are pushing. You can break into additional partitions and get more network bandwidth out of that specific topic.

Ok. Now let's dive deeper into the partitions themselves. Uh so here i have a topic, topic x with three partitions. Once my producer sends data, it will send data to a specific partition. So the partition is responsible to store the data in order. So as messages arrive, messages get an offset and that offset keeps adding up. Uh and that's how the kafka cluster maintains the order for the messages that are arriving. If i need to produce more, if i need more bandwidth, let's say your producer is not able to handle the amount of data that you want to push into the kafka cluster, you can create additional producers and produce into specific partitions. So now your topic can handle more uh throughput overall because i have more producers pushing uh to those uh partitions is within the topic. A producer should not write to different topics. Uh so like two producers writing to the same topic because remember order is kept on a biopic or partition basis. So if you have two different producers. The order is gonna be the as uh messages arrive. So you might get out of order messages. So when producing, try to stick with a specific uh partition on the consumer side, it works uh quite similar. So consumers will fetch the data uh from the metadata from the c afc a cluster to understand what my partitions are where my topics are. And then they would make network connection to that to start consuming, they will start consuming in order and as they consume messages, the cluster keeps track of what is the offset that they are reading. So if there is network disruption, once clients connect back, they could say i wanna start reading from the last message that i read. So they don't need to start all the way over from that specific partition or topic that they're reading from.

Let's say cluster is producing too much data that is specific consumer cannot keep up. You can also divide and increase the bandwidth. So there is this concept of consumer group, you can have the same consumer group uh and several consumers within that same consumer group and they will get spread across different partitions to read data. So you can uh increase the overall application bandwidth in terms of reading messages from that lesser.

Ok. So now i would like to give you a a real world example on how you would break the financial data feeds into topics, right? So here i have a simple example which is an exchange. The asset class is just securities talks. And i have three different uh feeds. I have the top up book feed. I have the last trade and i have the delay 15 minutes. So i can essentially get one topic per feed and cus customers can subscribe to that specific topic and only consume that specific feed. So let's say i just want the trades. I can subscribe to last trades and i don't care about the top of book, all the quotes and uh asks and bids and everything. If i have a specific top, let's say the top of book, which there is a lot of throughput, i can scale you different ways i can scare you through adding more partitions so i can increase the bandwidth or i could subdivide that topic. I could say i now have the top topic, top of book trades or top of book quotes. So i can even divide it even more. So there are a few strategies that you can use to add capacity, add, add bandwidth to uh your topics.

Great. So we discussed about uh c a fa uh really quick and we gave a reward example. Let's now talk about how you would deploy that cluster within uh aws. So always starts by the cluster creation on ms k. We take care of that for you. So we will spin up the clusters, we will configure the clusters, we'll form the cluster itself, including brokers and zookeeper notes. One thing to keep in mind is that that cluster does not reside in your vpc. That cluster resides on a service vpc. Meaning if you go on your ec2 instances and list all the ec2 instances that are running, you're not gonna see the brokers and the zookeeper notes, right? So you might ask when i configure ms k cluster, i say what is the vpc? And what is the subnets that i want my cluster to be in? So what we do there is we learn some elastic network interfaces or en i for short and we link back to the cluster. So the end result that you get is the look and feel as if the cluster was inside your vpc, right? But it's essentially linked back to the service vpc. With that you are able to get network connectivity controls, you can use network ac ls or security groups. So you can essentially allow or disallow network connectivity to your cluster from whoever you want. We also provide a dns name to the cluster. So each broker will get the its own dns name, so you can connect through that dns name as well. And as i mentioned before, ms k is fully compatible with all the kafka tools. So you can connect leveraging the kafka tools that you already do today.

Good. So deployment of the cluster is really easy. You go on the console or cli whatever tool you choose to and we spin up the cluster for you. Now, how do you make your customers connect to that cluster? Right. So the first option that we see a lot of customers using is the public connectivity. So let's say you have uh a customer that is not in aws uh and wants to consume that uh feed through the internet. So uh we launched a feature in november 2021 and we now have the ability to allow public access into the cluster. What that means is if you put a cluster in a public subnet, we'll give public ip addresses to those brokers and those brokers can connect straight into the internet. If you have the internet gateway and the routing tables, uh all the appropriate network connectivity configuration, they can be accessed straight from the internet that simplifies a lot in terms of the deployment before that feature, you had to front end that with a public nlb and get connectivity through the public nlb. Now you don't need the nlb anymore, right? There are a few requirements though because of security. Uh we uh put a few requirements. One is you have to be public subnets that's kind of obvious you cannot have authentic, unauthenticated traffic. You need to enable authentication and you need to enable encryption that's just to keep your cluster secure. Another requirement is that you have to have kafka 2.6 and above.

Ok. So we mentioned the public, how about the private, how about that use case where my customer is already in aws, right? Your customers there? They want to consume that data feed through aws without going to the internet. So with that, we're already give a hint, we're gonna use private link uh to achieve that. How do you deploy that? First? I have the scenario here. I have my own vpc. This is different vpc from my customer's vpc. It could be even in a different account. The first thing that i need to do is to deploy a low balancer. Uh private link requires an nlb uh in front of the service that you wanna expose. So you need to deploy the nlb. After you deploy the nlb, you can enable the private link service and that will create an endpoint that customers can search and trigger the connectivity, the private connectivity into your account. They will deploy a vpc endpoint that vpc endpoint will connect back to your vpc end point service, the private link and you establish trust in between those two vpc s two accounts. The benefit here is you don't need to worry about any of the network connectivity in terms of overlapping ip addresses routing. None of that stuff we take care of that for you through private link. So as r i mentioned is ideal for sass applications where you don't wanna really care about the cause at your client infrastructure, you just wanna expose that to your clients. So private link makes that the perfect scenario.

So now i have the piping between my vpc, my client vpc, i can establish network connectivity. So i reach out to my bootstrap servers. That's what you provided to your clients. They will have the b one dash or kafka cluster name, kafka and all that domain name there. However, that is a private dns name. Now your customer is in a different vpc. How do they get connectivity to th to that domain name? So we recommend to deploy r 53 hosted zones and with uh r 53 hosted zones, you can alias the broker names into the vpc end points. So now you can resolve into the brokers. Now you can truly connect into your brokers.

So the way it happens, you reach out to the bootstrap servers. Uh in this case, port 1994 is the uh tls port that kafka uses or i should say ms k uses and i can connect to my brokers, retrieve the metadata information on which topics exist where they are placed with that. My client application can pick a topic to consume. So let's say in this example, i would get topic three that resides on broker three. So if they connect back through that same port number, what is gonna happen is because nlb is configured to do ro robbing i could land on broker two, but i'm trying to connect to broker three, right? So kafka will throw me an arrow that i'm trying to read from a non leader partition. So to solve that problem, there is a functionality within uh kafka, it's called advertise listeners. So i can customize ports that i want to tell customers to connect to. So if a client wants to connect to me as broker one, please connect to port 8441 in this case. So now when they retrieve that metadata, the port that they wanna connect to broker one is 8441. I now deploy a listener on that same port that targets my broker one only. I repeat that for all the brokers within my cluster. Now when client, their client application needs to connect to broker three, it will connect to broker three at port 8443. That means that i'm gonna hit that listener on 8443

That is only gonna take me to broker three. So now I can have a specific connectivity into that broker. And I solved the problem of hitting a broker that does not contain the partition or topic that I have that I wanna read from. Cool.

So we need to understand what are the pros and cons of that architecture, right? So the first thing is it's a quite flexible architecture, you can deploy uh any sort of authentication method through that uh architecture. You can even do unauthenticated even though we don't recommend, right? Uh the cons are you need to maintain all the listeners on the NL BS, you need to maintain all the advertised listeners for the brokers. So uh that, that would be the cause for this type of uh architecture.

However, it does work, it does scale really well, you can increase the amount of brokers without problem just by adding another listener. So this is how the end to end architecture would look like i deploy a cluster. They are their cluster is on a private uh on public subnet, i have internet gateway and i have connectivity to that cluster through the internet. I have an lb and private link which allows me the private connectivity so i can get to that same cluster. So i address both use cases customers that want to consume from the internet customers that i wanna consume from private link. I also can enable uh direct connect and connect straight into my vpc. So my producer can send data from whatever they are into that cluster. Uh so producer gets connectivity, consumers get connectivity from both internet and private link.

Great. So we discussed the architecture, how a client would connect. So i wanna give you a quick example on a k afc, a client leveraging python, right? What would it take for you to write a simple python code to consume and produce into your ka fca classroom?

So the first thing that you need to do you need to import the kafka library? Uh there is a function called caf a consumer. So that will instantiate a consumer client. I import from environmental variables. Uh but the list of my bootstrap servers then i instantiate that client. So on that client, i need to put some configuration, meaning what is the topic that i'm gonna consume from the group id? So remember the consumer group id that we discussed, i need to define that in here. And i also put some of the tls configuration there. Then i can just look through the messages in here. I'm printing the metadata information from that message. Meaning what is the topic? The partition, the offset. Uh and then i can also print a message if i want to.

Now for a producer uh client, i saw the same way now instead of importing the consumer uh function, i import the producer function, i get a bootstrap service from my environmental variables. Then i instantiate i uh that function i initiate a client, i send a few configuration parameters as well. The brokers, the tls configuration, the serialization method that i use. So if i wanna use, for example, just plain text json or if i wanna uh any sort of binary format like pro above i can instantiate there and then i can define a message and push that message.

One thing that i forgot to mention on my previous slide is on top of the slides there are some qr codes. You don't need to worry about the qr codes. You're gonna get that presentation later on so you can get the qr codes. But in that qr code, it is gonna take you to a gi hub repository on aws samples and that you're gonna find the cd k code to deploy that infrastructure for you. So you don't need to do it manually. You can just deploy the cd k code. You're gonna find the kafka client uh samples here and you're gonna find some of the best practices in terms of monitoring that we're gonna discuss later on.

Ok. So before we go to monitoring, let's talk about security, we already talked a little bit about the network security through the network ac ls and security groups. Uh we, when it comes to encryption offer two methods, the encryption at rest with uh k ms. So you can encrypt the ebs volumes, leveraging k ms keys for encryption in transit. We enable tls so we can enable tls for inter broker and inter cluster communication. We can enable tls from client to broker communication. And then there are the service api s, they're also leveraging uh tls. So we can enable data uh encryption in transit all the way through when it comes to authentication. Our talk is gonna focus on the digital certificates. So i'm gonna go deeper on the next slide and then we're gonna talk about how you authorize through kafka ac ls as well. So those are the four levers that you have when it comes to securing your ms kc classroom.

So let's go into authentication. If you come here, i have a list of the open source, kafka and uh ms k and what are the authentication methods that they support? It's pretty much the same with the exception that ms k does not support cobras in terms of authorization, the authorization is done the same way through c faa cls.

So let's talk about the certificate, how you would enable the digital certificate, all the pkis within the ms k cluster. So we start from the cluster itself in a client. So the cluster, you need to enable tls authentication there. The first step is create a private c so i have a root pki there that i will link to my ms k cluster. What is gonna happen is i enable tls. The broker gets the digital certificates from a public pk i owned by amazon and imports the root certificate from my private certificate authority.

Now i create certificates to clients. They sign, create a sign request, they send to that private c a, that private c a issue their certificate and now the client can authenticate against the cluster. But if the cluster sends the inform the tls information, it certificate was signed by a different pk i. So i need to import the root certificate from that public amazon c a into the client as well. So, mutual authentication and trust can happen once that is done, authentication is gonna uh occur and i can use the dn of the digi the digital certificate to do authorization on c a faa cls.

So how authorization happens, right, when you enable? So before that by default, msk does not have ac ls on topics. And the behavior is if there is no ac l si allow reads and writes all operations into my topic. When i enable public access to my brokers, we changed that behavior. And now if no ac ls are found, nobody can do any operation within the topic.

So for this architecture, we always gonna need to configure ac ls because ac ls will be enabled by default, which changes the kind of default behavior from ms k. So here i want to have producer, one write into both topics and consumer, one read from both topics. So the first thing that i need to do is to create a uh ac l entry. The ac l works as principles are allowed or denied an operation into a specific resource. So in this case, producer, one is allowed to write into topic one. So now that enables producer one to produce into topic uh one, i want producer, one to also produce into topic two. I need to add another entry for topic two. So producer can write that same goes for the consumers. I need to add the consumers uh ac ls uh entries so i can read from both topics.

Ok, good. We discussed how you deploy the ms k cluster, the architecture, how you would connect with clients using python of security, all the authorization and authentication components, your clients are now consuming your financial data stream. In amazon, in aws, you are ready to start monitoring your service because you need to operate that service, right?

So when it comes to monitoring, there are two things that uh you have available for you with amazon, ms k. The first one are the kafka metrics. So we allow to you to export those kafka metrics uh in two ways. First one is through cloudwatch, so you can get all those metrics on cloudwatch. The second one, if you want to have some sort of open source grafana to look into this metrics, we can export to prometheus. We install an open source agent from prometheus on the cluster for you. And now we can export that to your prometheus environment. You have four levels of metrics that you can choose from. The basic. One is called is just uh giving you cluster wide metrics. The second level and they are additive. The second level would be the broker level. So i get cluster and broker information. The third one would be topic level and then the fourth one would be partition level. So i can get the full depth of all metrics available in kafka. Uh exported either to cloudwatch or to an open source uh prometheus uh stack.

I also have logs, right? So all the kafka logs, the kafka cluster logs can be exported into cloud watch, amazon s3 and amazon uh kinesis, right? So now you can get all those metrics, uh all those logging uh within amazon.

Ok. So you are now operating your cluster, you are monitoring, it's now time for you to actually make money out of your service, right? So for that, i will hand over to hanna and he would walk you through on how you would deploy aws marketplace with your cluster to uh start monetizing your data feeds.

Thanks, diego. You're welcome. All right. So diego has just gone over how you'd use ms k to publish your data feed. Now, uh i'll go over what you need to do to monetize your data feed and how you, how are you gonna make money from this? Uh the best way is to integrate your application with aws marketplace uh on aws marketplace, we currently have more than 325,000 active subs subscribers, more than 2 million subscriptions are on there. So essentially if you publish your application on marketplace, you are now gaining exposure to a vast array of companies who would potentially be interested in purchasing your product.

Now, the one of the nice things about marketplace is that it gives you a couple of things, it gives you a quick start, uh which is an application that you customize a little bit to handle registrations from your clients. And it gives you a bunch of api s. The api s uh can send data to the marketplace like usage data uh and uh uh entitlements data so that marketplace can help you enforce the entitlements that your users are entitled to after they enroll in your product.

So uh here's an example, you know, the quick start rolls out this application uh where your uh clients are going to fill out this form after they fill out the form your job is to handle the registration uh and whatever it takes uh to enroll your customer into your service. And i'll go over how that is specifically done for your ms k clients here.

So uh the first step is you just deploy the, the quick start. Uh and um that is actually deploying a serverless application. Uh and it kind of scaffolds out what you need to build. It's, you know, it's got that registration form in the front. Um and essentially it's based on api gateway and lambda and dynamo db. So uh your job is kind of to fill out that lambda function whenever somebody fills in that registration form, that data gets presented to api gateway and then your lambda is basically going to say, ok, here's a new client. What, what should i do?

The first thing you need to do because this client of yours is gonna consume your data feed. You need to give them a certificate just like diego mentioned, right? It's mutual tls. So you use your private certificate, authority to gen generate a certificate for your client. And remember what diego mentioned also is that you need to enable ac ls without an ac l. Your clients are not going to be able to consume any topic whatsoever, right?

So two things that you need to do when you enroll that client is generate that tls certificate for them and also update your ac l so that when they present that certificate to you, you're going to allow them access to the topic that they're entitled to.

Uh all right, another step is you can store whatever certificate that you've generated in, in secrets manager and you can store their enrollment information, you know, their company name and all these other, uh uh in information about them in dynamo, db, dynamo db is kind of a, you know, a, a nice way to persist data on a per client basis. And then, um after you've gone through these steps, because this is a web app that scaffolded for you, you can lead them to a page where they can download a quickstart guide.

So inside the quickstart guide, you can give them instructions like, ok, here's how you download and install your certificate. Uh, here's a sample application of how you could consume our data feed. Uh and here's the you know, the connection string or the, you know, the information you need to connect to your to the cluster.

So, uh marketplace is a good way. this this particular quick start app is a good way to get your clients up and running quickly uh to consume your data feed service.

All right. So we have learned about the different kinds of financial data streams. We've learned quite a bit about c afc and its use in the financial services industry. We've, we've specifically learned how you can use ms k to publish your data feed on aws and monetize it using uh the marketplace.

Some logical. next steps then would be if you'd like to find out more about all this. Uh there are some qr codes here uh that you can explore to gain a little bit more background information.

Uh the first couple of ones are ms k best practices and you know, best practice of right, sizing. The third one is pointing to a github repo where if you launch that solution in there, you actually will be able to automatically deploy a cluster that has all a lot of the components we've talked about. It supports both public and private access. It has a sample app uh to publish uh you know, market data and consume market data. So it's kind of like a starter, right? Uh a starter uh environment uh for you to play with.

And uh as diego mentioned, don't worry if you are not able to capture these q qr codes right here because this presentation is gonna be made public later on after, after the conference.

All right. So before you leave, um please make sure to uh fill out this survey. We really appreciate uh your feedback and just go into your app to fill out the survey. Uh really wanna thank you uh for your pre presence here and for your time here. If you have questions for us, uh we'll be at the back of the room uh to entertain any questions that you might have. So thank you once again for being here.文章来源地址https://www.toymoban.com/news/detail-769372.html

到了这里，关于Publishing real-time financial data feeds using Kafka的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！