Scaling data processing with Amazon EMR at the speed of market volatility

这篇具有很好参考价值的文章主要介绍了Scaling data processing with Amazon EMR at the speed of market volatility。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Good evening everyone. Thanks for joining with us. My name is Meenakshi Shankaran. I'm a senior big data architect with AWS. For the past three years, I have Sat Kumar Sami, Director of Technology FINRA with me and we are here to speak about scaling EMR at the speed of market volatility.

And before we get started, I have two questions:

How many of you have worked with clusters of sizes 400 or 600 nodes?
How many of you have used Apache Spark for batch processing?

That's great. So we all have processing deadlines. We have use cases that need to be completed in a specified amount of time and we have to deal with the business process in those allotted time frames. And when a crossing deadline is combined with unpredictable and huge volumes of data, it becomes a challenge.

And today we are going to speak about:

Who's FINRA
What is Consolidated Audit Trail
How did FINRA build an application that processes billions of events per day at scale
What are the lessons learned while we build this application and optimize it to scale at the growth of the market

With that I'll hand it over to Sat who will talk about the application, architecture and lessons learned from the application architecture.

Thank you, Meenakshi. Hello everyone. My name is Sat Kumar Velusamy. I'm a Director at FINRA managing CAT linker and ETL projects. I'm very excited to be here to talk about how FINRA processes billions of market events every day and deliver the feedbacks in four hours to the data reporters.

Also, I want to talk about the challenges that we had in implementing such a complex system and how we solved those challenges in the last two years. So we have a lot of content to cover today. I'm going to get started with some introduction.

So who is FINRA? FINRA stands for Financial Industry Regulatory Authority. It's a regulatory organization authorized by government to regulate broker dealer firms and brokers. Our mission is investor protection and market integrity. We ensure that every investor can participate in the market fairly and honestly.

We regulate 3,400 broker dealer firms and over 620,000 individual brokers. We work with 32 exchanges to get the trading activities from all those exchanges. We process over 600 billion, more than 600 billion records every day. And we collect 130,000 files from broker dealer firms and exchanges and our storage footprint is one of the largest in the financial services industry.

We have about 500 plus petabytes in our data lake. We create more than 9,000 clusters of varying sizes every day with the total instance capacity of over 300,000 instances. We process thousands of data injection jobs and hundreds of ETL jobs on those clusters.

What is Consolidated Audit Trail?

In response to May 2010 flash crash, Securities Exchange Commission (SEC) adopted a rule asking exchanges, all the exchanges that operate in US, and FINRA to create a plan to build a consolidated audit trail system. So this consolidated audit trail system helps regulators to easily track the market activities and use the data to analyze and find any market manipulations or any other activities that's happening in the marketplace.

FINRA was awarded this contract in 2019 and we built this system in two years. Most of the functionalities were delivered in two years and we are working on the last phase of the project.

So why is CAT so complex?

We get billions of data every day from broker dealer firms and exchanges and we have to process the data and deliver the feedback in four hours. So if you look at the timeline, both firms and exchanges, they have to submit all their trading activities that happened today by tomorrow 8am.

Once we get the data, we have four hours to process and deliver the results to broker dealer firms and exchanges. So it's a very tight SLA. We have to process everything in four hours, billions of records in four hours, generate the feedback and deliver to all the firms.

So how do we do this?

Here is a high level process flow:

Broker dealer firms and exchanges submit the data to CAT system. Every file that we ingest to CAT is validated in two stages. We validate the files - file itself, make sure the file naming, the files are named as per the requirement, file sizes are within the allowed limit and the file compression is as specified in the specification.

And once the file validation is complete, we open the file and perform the record level validation. Every record goes through validation. We validate for data type, data size and allowed values and some conditional validations.

Good records are accepted for further processing. Bad records are written back to the reporters.

Good records will go to the next stage of processing, which is the linkage validation process. In the linkage validation process, it's a linkage discovery phase where we find every secondary event, find its parent event. So this is the critical job in our CAT processing.

We have only four hours to process and deliver the feedback to all the reporters. So after the linkage validation is complete, both the feedbacks go to firms - linkage validation feedback and the injection feedbacks are reported to firms and exchanges.

The feedback file for firms goes through the same SFTP route and the feedback for exchanges are delivered through S3 buckets.

Once this linkage validation process is complete, that's the 8am to noon process. So we have to deliver the feedback by noon and then we start, we'll wait for three days for the corrections to come in. And then we start assembling the life cycles and then enrichments.

All those things run on EMR clusters, massive EMR clusters, 500 or 600 node clusters and all the processed data stored on S3 and registered in our data catalog that's available for regulatory users.

So what are the challenges in building such a complex system?

CAT has challenges in all aspects. Performance is a main challenge for us because we have to complete the process in four hours. There are many reasons that could impact the performance - data volume, a sudden spike in the volume and the application itself is complex. There are hundreds of linkage rules. So this process has to go through all those linkage rules.

Data skew could impact the performance and from random Spark issues due to volume, you may get random Spark issues that could impact the performance.

Resiliency - CAT processes billions of data in four hours, we don't have time to lose even five minutes. The job has to recover immediately and start processing from the step where it failed.

And scalability - market volume fluctuates based on the market, right? So if there is a 5% drop in S&P or Dow Jones, we'll see like 40 to 50% increase in our market volumes. And also year over year, we see 20 to 30% volume growth. So scalability is a challenge. We have to scale to any given volume and process within four hours.

Cost - As the volume goes up cost also goes up, right? So volume, if 30% increase in volume means we have to use 30% more capacity, compute instances. And there is a storage cost associated with increased volume. But our goal is to keep the cost same or lower by using better hardware and optimizing our application.

With all these non-functional parameters - resiliency, scalability and application performance - all these non-functional parameters are very critical to achieve the SLA.

Adding more capacity to the process or the job is not going to improve the performance of your application if the application is not able to scale linearly. So it's a balancing act - application has to perform well, resiliency and scalability. All of those have to be handled efficiently.

So in the last two years, we did a lot of optimizations to meet this SLA. In the next few slides, I'm going to share some of the key improvements that we did in the last two years, which improved the overall performance of the system and we are able to meet SLA consistently for the new peak volumes.

When we started this project two years ago, we went with the non-progressive architecture. What that means is we process all the data, the linkage validation process that runs between 8am and noon. So we wait for all the files to come in and then we start the process at eight o'clock and we'll try to complete it by noon.

Initially, when the volumes were low, it was fine, we were able to process and deliver the feedback. When the volume increased, we were not able to process and meet the SLA because we are processing after 8am. There is no pre-processing happening.

So how did we solve this?

We went with a progressive architecture where we process 80% of the data before 8am. Certain linkages can be done before 8am. If a firm submitted all their transactions to CAT, if they are completely done with their submissions, we can do the intra linkage, the linkage that happens within the firm, and we can keep the data aside, right?

So we started processing in two batches. We looked at the file arrival pattern for the last six months or one year. Based on that, we came up with a schedule - midnight, we process the first batch where 50% of the volume is processed in the midnight batch. Which means firms submitted data, almost 50% of the firms submitted their data to CAT by midnight.

And then we run another batch at 4am where we process 30% of the submissions. So overall 60 to 80% of the submissions are processed before 8 o'clock. Now 8 o'clock comes, we have only 20% to process.

So this 20% intra linkage has to process and then we perform the inter venue linkage and then publish the feedback. This implement the change in architecture and the schedule helped us to save two hours, almost two hours in our 8am to noon processing. So it's a big win for us.

The key takeaway from this is - look at the architecture. If you have a massive data processing application with a tight SLA, look at your architecture and see what you can process ahead of time before the core hours and minimize the amount of data that you have to process during the core hours.

You can look at the business process and your process pipeline and see what optimizations you can gain, what can be done before the core hours.

The next improvement is the Spark version upgrade. When we started two years ago, we were using EMR version 5.27 and Spark version 2.4. Those versions, at the time for those volumes, we were fine, we chose this version and we didn't have major issues during the initial days.

And after a few months, we started seeing problems because of the volume, because there are some open source bugs in the 2.4 version, we get random log4j errors and then Spark speculation had an issue. So we couldn't use Spark speculation for better resiliency and performance.

So we had to upgrade to new version. We looked at the new versions that are available at the time. EMR 6.5 was the latest version and Spark 3.1 came with EMR 6.5. And simply upgrading to this version and using new instance types, the NVMV based solid state disk based instances, just by upgrading to the new version, we were able to get 30% performance improvement.

And also with the new version, we have new capabilities, right? New instance types are available for us. We can use the new instance type, cheaper instance types, right? So that's another improvement that we did early this year.

The takeaway from this is - I think we should have a regular cadence for upgrading our software. Like, you know, every six months or one year, we should upgrade to newer versions to take advantage of the performance that comes with it and also the bug fixes that comes with the new version.

The next one is the CAT application challenge. This is related to Spark shuffle. Those who are working in Spark or worked in Spark, they know shuffle is a killer. Shuffle is bad for performance, right? Because it requires better I/O because it has to transfer the data, it has to transfer the data from one node to another node, right?

So this linkage process has multiple group by operations, every group by operation, triggers a shuffle operation.

And we have about some keys, some linkage keys have over 100 terabytes of data that we have to shuffle. We were using EBS backed instances at the time. And the IOPS was saturated on the EBS volumes because 100 100 terabytes of data shuffling between the nodes uh was a heavy IO operation.

So how did we fix that? So we looked at the NV MV NV M based instances, NVMe is good for both random IO axis and also sequential IO axis. And it provides better IOPS for the shuffle shuffle related works.

Mm we used Graviton incense types. So we benchmarked three or four instance types that are using uh NVMe based uh disks and we picked Graviton because Graviton gives best of both worlds, right? It, it has NVM disk and also it's cheaper. So we used Graviton and we had 30% improvement after moving to Graviton and also the cost wise or 50% saving uh for us.

So the next challenge is the resiliency. I, before we talk about the problem, I want to quickly talk about the um our orchestrator. So we created an in-house orchestrator which has all the features of the open source uh orchestrator. So we use a Lambda function to trigger the jobs. It takes the job parameters, we take the job, we take the job parameters and we have a standard template, Step Function template um which is, which takes those job parameters and then it launches the EMR cluster and it adds the step to the EMR cluster and it's a Spark application.

The Spark application runs on the EMR cluster, the way we design the application, it's a single application, single Spark application. It runs for an hour or two hours or three hours depending on the volume. And if there is a failure at the end of the third hour or second hour, the entire job fails, if we have to restart, we have to restart from the beginning that won't work for us. We have only four hour SL we cannot lose two hours of work that we already completed. Right.

So how did we solve this check resiliency by check pointing and restart, we check point at regular intervals. So we've broken the application into small steps. And after every logical step, we checkpoint the data, we have two options to checkpoint. You can checkpoint on EMR or you can checkpoint on S3.

There is a cost associated with check pointing, right. So there is a right time, three or four minutes for this massive scale, there is a three or four minutes of time, but that's worth spending. We don't want to lose two hours or three hours of processing. So we can afford 34 minutes of checkpoint in time.

So after every stage, we checkpoint, we checkpoint on EMR in some cases, we in HDFS, we checkpoint on S3 in some cases. So that's up to our requirement. Like if you want to start the next step on a new cluster, you can checkpoint on S3. And also if you have dependent jobs that is expecting data from this job, you can check client and S3 so that three or four jobs can use the same data.

And we track the status of uh the steps on a, in a DynamoDB table. So that, that keeps track of which steps are completed and uh which are failed, stuff like that.

So just to summarize, check, pointing at regular intervals, helpful to recover from failures, right? And we checkpoint after a logical step, every logical step, each step is not more than 30 minutes for our use case. That's how we decided. And we start, we restart, uh we maintain the state in the DynamoDB table and then we restart uh based on the failure, right? We restart automatically. The having automatic restart is helping us. We almost like in our experience, more than 50% of the restarts, the failures, the first time failures were completed successfully after the restart. So we don't have to manually intervene and see what failed. So just have a auto restart, the job restarts automatically.

I saw we cannot stop by just by optimizing the application, right. So we have to optimize the infrastructure. What if the application is scalable but the infrastructure is not available, it's not optimized. So with that, I will hand it over to Mei, he will talk about how infrastructure were optimized for scalability and uh resiliency.

Thanks uh thanks. Thanks for sharing information on optimizing application architecture for better performance and resiliency for any application with tight processing deadlines. In addition to application performance and resiliency, your infrastructure, as well as AWS services has to give better performance and presi as well.

With that, when we, during the initial days of processing, we started en encountering a resiliency problem under performing instances fra on an average creates about 300,000 instances per day. And what that means is they will land on every server in an availability zone at any point in time. And with that probability, there is always a chance of landing on an instance or a physical server which has a known issue or an unknown issue that is manifesting in those servers. These are hardware, right? So you tend to have known problems or unknown issues manifesting in those service.

And every time we encounter an instance which is under performing and what do i mean by under performing is we expect it to finish the processing in a certain amount of time. And those instances are not doing that. We were impacted by 15 to 20 minutes, 15 to 20 minutes for every instance is, is very critical for us in a four hour time frame. That's 10 to 20% of the processing time wasted because of retries and, and the stagger tasks.

And when we encountered this under performing instances, there are three symptoms, we are normally impacted by the first one is fetch failures as with any shuffle process. The mapper process creates the temporary files, the output files which a reducer have to fetch to complete a shuffle process. And we started seeing a lot of fetch failures on the instance which is under performing. That was a very first impacting behavior that we started observing.

The second behavior that we started observing is inconsistent task performance in a stage. On an average for us, anywhere between 10,000 to hundreds of two, hundreds of thousands of tasks will be there on a stage. And what we started observing is the task performance. The 75th percentile or the 95th percentile performances were deviating a lot. And we saw all these tasks which are running slower originated from a common instance.

The last behavior that we started seeing is stuck tasks. The executor has done completed the processing, it has communicated back to the driver, but the driver thinks the executor is still working on those notes and the communication between these two are broken because of an under performing hardware. Again, not, is this a problem we can solve? There is always a probability of having hardware with known or unknown problems. And it is very important to build your application with resiliency, resiliency and fault tolerance so that we can digest those under performing hardwares and keep the processing moving.

We started implementing workarounds. And from that, from a resiliency standpoint, the first important workload that helped us is choosing a proper instance type we migrated our workloads to Graviton wait instances which after migration, we saw reduced error rates with reduced error rates, better stability of the overall job performance.

The second important worker on that help desk is EMR upgrades. The EMR upgrades with EMR upgrades, we get new version of Spark where a lot of resiliency and scalability related features, new features have been implemented or existing problems have been fixed and that fixes helped us solve the problem of communication gap break between an executor and a driver. So that helped us a lot.

Did we stop there? The third important configuration, the optimization that we started working on is configuration optimization both at job level and also at cluster level. And one of the important configuration that helped us solve this inconsistent task performance is speculation. Speculation is a feature in Apache Spark when on which the Apache Spark engine will execute the same task in two different instances and whichever the task that completes first or faster is accounted and the task that completes slower is neglected.

Now, what we were able to eventually do with this feature is remove the bad note or the slow no or the under performing instances from the overall processing itself because the same task is processed on two different notes, we were not impacted by the underperform hardware there.

And the second configuration that we started looking into is we want to exclude not just from the processing in two different notes. But we also wanted to exclude the instances from the oral processing in the next subsequent stages. We know this stage is impacted why use the same instance in the next stage. And again, Spark has another feature called exclude our host. There are a lot of heuristics you can play with and we optimize our configuration based on the impacts that we wanted.

And with the help of configuration, we were able to exclude notes from processing for that particular Spark application. The last thing that we started building is in addition to optimizing the cluster configuration application configuration using new instance type, we also build monitoring capabilities. We started building our custom monitoring capabilities which will monitor these bad instances. And we also automated using AWS SSM agents to remove those bad instances apart from the cluster.

While the processing is happening on the background, the SSM agents will go ahead and shut down the note manager and data note on those process and and e will eventually remove those notes from the cluster. So to understand the trends on which we were impacted right around December and early January this year, we were impacted about 40 to 20 times a day. And to put it in context, we fra create on an average about 300,000 instance, which means 5 million, 45 to 6 million instances per month. So 43 instances of 5 million is very negligible. But in terms of impact us. It's, it's, it's significant every time we get up under performing instance, it's 15 minutes for us.

But after the upgrade and Graviton migration and also fine tuning speculation as well as exclude on host configuration, the impact has reduced significantly. We are either one or two occurrences. We are either one or two occurrences a day, a month or zero occurrences a month and we have the capability to withstand one or two impacts a day with resiliency on application and and the infrastructure sorted out.

The next important challenge we encountered is scalability handling birth rates on an S3 partition typically. And S3, S3 offers 3500 puts and 5500 gets per prefix and f S3 prefixes or follow a traditional high partition model. What that means is you have the partition field and its value in the prefixes. And with that, the data is, is always partitioned by either a processing date or a trade date.

And we have multiple workloads accessing the same trade date or a processing date level data sets within a trade date or a processing date level. And because of multiple parallel workloads accessing the same prefix, the with, with EMR and Apache Spark, the traffic towards those prefix increase, we need higher through for the prefixes that for those prefixes at partition level at at trade date or processing date.

And we started saturating those prefixes once we start saturating the amount of throughput that each prefix provides you. We started seeing five threes slowed on errors and with slow on errors, it not only impacts your performance but also job resiliency and stability. We started seeing job failures.

How did we solve this problem again? The first thing we started doing is we collaborated with S3 service team to understand what are those prefixes which are impacted by five threes. And with the help of S3 team, we are able to partition partition those prefixes in a specific way so that the throughput on that, that prefix level increases.

The next important feature that we work with. EMR service team is rate of increase, multiplicative decrease free trade policies by default. EMR comes with exponential backup re trade policies and with AMD re trade policy. AMD is algorithm used on TCP conditions. And what EMR does is if there is a bandwidth, if there is bandwidth available at a particular prefix, you can increase your record rate, it will automatically increase it. And the moment we start encountering five threes, it will automatically decrease the request rate.

And after moving some of our workloads to use AMD ret trade policy, the the five threes came down by 79% which is in terms of very significant for us, we were facing millions of five threes and 75% reduction is gave us the stability that is needed for our jobs. And with the reduction in the five threes, not only stability, the performance also to an extent for certain workloads, it improved.

We also looked at managing our workloads. One of the important areas that we looked at is partition pruning. We made sure that when you have concurrent workloads accessing the same partition date, we wanted to further partition it even further down and see if we can reduce the amount of data that is being processed at a job level.

The we started optimizing the file sizes. We standardized 128 mp to one gb of file sizes of a parquet file format. And by doing so, we reduced the number of S3 objects and that eventually helped us reduce the f three. And the last optimization that we started doing is we started working on our, with the help of the orchestrator that we build, we started working on scheduling these applications in such a way that we reduce the impact of the res concurrent reads and rights at a prefix level.

Now, the next scalability challenge that we encountered is capacity at 8 to 12. Typically, we we launch clusters of sizes ranging from 100 to 100 nodes to 600 to 800 nodes. And we don't have single cluster, we have multiple cluster running at the same time. And what we started encountering is capacity problems with four of tight SLA we don't have the bandwidth to be iced insufficient capacity exception. We don't have the bandwidth to be impacted by insufficient capacity exception. So guarantees, capacity guarantees is very important for the overall processing.

How do we achieve capacity guarantees? We started leveraging on demand capacity reservations for all the critical workload. We started reserving instances using the ODR capabilities. And the next problem we started once we reserve the instances is we wanted to make sure these reservations are used by our critical workloads. We don't want noncritical workload to hijack or use the reservation and and and the critical workload again get impacted by that.

And that's when EMR team came back with the support of targeted ODR. So EMR can now support targeted ODR targeted on demand capacity reservations and you can target a specific workload to use those reservations. And with the help of targeted ODR, we were able to manage the reservations effectively and keep it at use it for the critical workload. And on the weekends when we, when the processing is not happening, we suspend the ODR and then we resume it back when we needed it back.

So we started improving performance scalability resiliency. We started processing huge volumes of data, the volume increase and we are still able to process those volumes of data. And as we increase more clusters and as we improve the overall application efficiency, we also wanted to make sure we are keeping cost under control.

And one of the important cost optimization that helped us is migration to Graviton with the EMR upgrade upgrade enabled us to use Graviton instances. With, with 5527, we were not able to use Graviton. But once we migrated our workload to 65, it opened up new instance possibilities and Graviton is one of them. In addition to performance. After migrating to Graviton for the intra linkage process, specifically linkage process, we saw 60% overall price better per price performance.

We spoke a lot about critical workloads. We also have workloads that run outside of the four hour window. And one of the common problems that we started seeing in those workloads is spot capacity. These workloads have instance fleet in their configurations with 1 to 5 instance of types specified in those fleet configurations. And we try to use spot for noncritical workloads and if spot capacity is not available after 30 minutes, then we fall back to on demand.

And with these workloads with 1 to 5 instance types, the fleet, we started seeing capacity problems with spot and because of that, we eventually have to fall back to on demand. And once we fall back to on demand, our cost increased. And with not only cost even with on demand instance types, sometimes we got into capacity problems. We have 1 to 5 instance types and the instance types that we were asking were not available.

How did we solve that? So we started looking at instance type diversification again, early this year i started supporting instead of five instance types in a fleet, they started supporting up to 30 instance types in a fleet and we change our cluster definitions, configurations to include more instance types from different families, part of a cluster creation request. Now they have 5 to 15 instances depending upon what workload it is. And, and we also utilize allocation strategy which, which is a feature in EMR and EC two, which helps you get capacities based on capacity, optimized for sport and cost optimized for on demand. It offloads the capacity to from EMR to two and gets you the capacity much better way. We started using allocation strategy.

And after migrating using started after, after we started diversifying our types, we started seeing less cluster creation failures and we also our overall success rate improved. We are still working through to figure out how to get to a better combination of those 15 to 30 instance types. But the success rate improved significantly after we diversified our instance type with all these optimizations early this year, we were either our SLA met percentage was very less. We are about December 40%. In January, we didn't meet the SLA a with all these optimization, Graviton upgrade and and the capacity related optimization. Our SLA achievement increased. We are at 100% now with, with the increase in 80 to 80 to 90% increase in volumes.

January, we are about 100 billion, 100 to 200 billion records per day. But now we are about this is just for the intra linkage process, not the entire exchange, but for intra our volume increased by two times. But still we are able to meet the SLA with all the optimizations.

Did we just improve performance? No, with Graviton adoption and further optimizations of the application, our computer hours came down by 50%. We were at 9800 computer hours on a daily basis. Now we are at 40 to 50,000 computer hours, which means we are processing more efficiently than what we were doing. The starting of the year.

Did we stop? No, we are, we are looking at new capabilities that could help us scale even further better. We are evaluating EMR and EKS e as well and Apache Iceberg for scalability and resiliency with the success of Graviton, we are looking to adapt Graviton three and EBS gp three volumes for better, for better performance. And we are looking at new versions of Apache Spark 3.2 to use features like Magnet and, and new shuffle performances. And we are also looking at application modernization where progressive still does multiple batches at a time, multiple firms at a time, we wanted to paralyze crossing by firms.

Now, the key takeaway after with that, the key takeaway that, that, that we want to call out is um for any application with a tight crossing deadlines and huge scalability problem. It's important to have performance engineering part of your developmental life cycle with focus around not just on application, infrastructure and scalability and resiliency with, with focus on all those four layers, we should be able to build an application which can scale for higher volumes of data, you want to add anything.

So yeah, I think um when you are choosing the instance type, um choose the right instance type for your workload. So if you have a compute problem, use compute intensive, I mean compute uh optimized instance types. If you have Spark application which uses uh memory and uh just large shuffles go for uh NVMe based uh instance types and also have a regular cadence for upgrading your software like every six months or one year and, and, and design for scalability. Let's design your application for so that it can scale for any new volumes that you see peak volumes. Yeah, thanks.

Thanks everyone for your time. I think we are able to provide some value and then we will be available offline if you have any questions. Thank you.文章来源地址https://www.toymoban.com/news/detail-774346.html

到了这里，关于Scaling data processing with Amazon EMR at the speed of market volatility的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！