February 9, 2021

Scale out workload using Azure HDInsight

Filed under: architecture,cloud Himanshu @ 12:47 pm

Scale out workload using Azure HDInsight

Scale is one of the reasons why organizations go for cloud computing, regardless of their size – enterprise or startup and whether it is for scale up or for scale out. For the uninitiated, simply put Scale Up is to add more power to single machine and Scale Out is to distribute the computing workload across multiple computers. Scale out enables the possibility of scaling to a level that is not possible with scale up. Scale out gives you virtually infinite scale.

Using the arsenal of Azure services, nice varieties of architectural patterns are possible for scaling out for large workload. Azure Functions, App Services, Azure Kubernetes clusters, HDInsight cluster, Azure synapse, Virtual Machine scale set are the most commonly used Azure services for scaling out depending on the kind of workload. We recently solved a large data processing workload using Azure HDInsight.

We used Azure HDInsight for a healthcare client that has datasets with multiple years of patient treatment data (anonymized) across multiple entire health systems. The processing itself was complex and entailed many transformations and computations. The end result was to submit the analyzed information to CMS (Center for Medicare and Medicaid Services) in a condensed and auditable manner. We designed our solution exactly for this high-demand situation. We used Apache Spark on an Azure HDInsight cluster. Azure HDInsight is a platform that makes it possible to easily create cluster of computers with preconfigured open source frameworks like Apache Spark, Apache HBase, Apache Kafka. Since we used Apache Spark; we loaded up the processing code and raw data into the cluster to complete big data processing. We spawned a cluster of about 40 computers working together. Effectively, this meant we ‘created’ a huge computer with about 2.5 terabytes of RAM (that’s right, it’s not typo, terabytes not gigabytes and RAM not disk storage!), about 325 processor core equating to about 650 virtual cores and about 16 terabytes of disk space, working together for a single large workload. Using this design and scale, we could complete the workload in about 8 hours that otherwise would have taken about 2-3 weeks to complete. Finally, we just needed this temporarily so we could tear the entire infra down after we were done thus keeping our infrastructure cost optimal.

While it sounds cliched but there is a lot of truth to the statement “Cloud computing is a big game changer”! There are several situations where having access to large computing resources can be a great competitive advantage. Thanks to AWS, Azure and other cloud providers, this can be done at a fraction of capex however your software architecture needs to support it.

Powered by WordPress