Guest post by Jigar Bhalodia, Data Infrastructure Engineer, Hanyu Cui, Sr. Software Engineer, Data Infrastructure, Stephen Goeppele-Parrish, Cloud Infrastructure Engineer, and Chris Han, Sr. Engineering Manager, Komodo Health
At Komodo Health, we have been growing rapidly in our mission to reduce the global burden of disease by building software and data products based on a foundation of health data. Our Healthcare MapTM captures the experiences of more than 320 million Americans (de-identified) as they move through the healthcare system. We continuously add to data sources and clinical encounters to ensure our data is the most current, complete and connected.
Our company has experienced 100% YoY growth over the past several years (our engineering team has grown from 35 to over 90 engineers and data scientists in the past 18 months). While this growth is exciting, it also presents a number of interesting scalability challenges – one of those being how we run Spark internally for ad-hoc analytics and batch ETL jobs.
We have been running standalone Spark clusters on Kubernetes nodes for 4 Years for both production ETL pipelines and ad-hoc analytics. For analytics, Jupyter notebooks are a popular tool of choice among our data scientists and data engineers to interface with Spark for distributed computation. For batch ETL jobs, we utilize both Spark and Airflow on Kubernetes. Each engineer or scientist, with the help of a lot of automation, received their own Spark cluster where they could tune and modify it to their needs. Each cluster was self-managed in AWS with its own dedicated ASG and EC2 instances.
The Spark architecture worked great for our earliest data scientists and data engineers who were hands-on with the infrastructure, and gave them a lot of flexibility to configure the cluster as needed. As we grew, we needed to evolve our infrastructure to reduce costs, scale access, improve engineering productivity, and improve resource efficiency.
Vision and Goals
Our vision was to address each of these problems by building a platform on AWS’s Elastic Map Reduce (EMR) 6 which would:
- Improve onboarding times by providing engineers and data scientists access to Jupyter notebooks, JupyterLab, and Spark compute on Day 1.
- Abstract away the complexities of infrastructure and reduce overall support burden on data scientists and engineers.
- Drive down EC2 costs by autoscaling and optimizing resource utilization (such as using spot instances).
- Support multi-tenancy with containerization to avoid over-provisioning compute.
- Improve overall cluster stability and enable blue-green deployments.
We implemented a Spark platform with JupyterHub as the frontend and EMR 6 on the backend. This supports existing workflows without burdening developers with the challenges associated with managing Spark clusters. Data scientists and engineers are now able to focus on what they do best, and the infrastructure team takes care of the rest. The whole solution consists of a frontend Kubernetes-managed JupyterHub server that serves notebooks, a backend EMR 6 cluster that runs Spark, and a whole set of processes and CI/CD pipelines that make sure we consistently deliver high-quality features without causing regressions.
Our specific approach to architecture and design has been critical to the success of our solution.
Platform Architecture with EKS and EMR 6
The open-source, distributed, multi-user Jupyter notebook environment makes it easier for groups of users to work with Jupyter notebooks. JupyterHub allows us to present a web page that our users can visit to quickly get started with Jupyter notebook servers. The users do not need access to or knowledge about AWS and Kubernetes. All that’s required are their single sign-on credentials for authentication.
We run JupyterHub on an EKS cluster deploying the open-source Helm charts provided by Jupyter. When a user visits our JupyterHub webpage, JupyterHub will authenticate via Okta and create a notebook server pod for that user via a proxy, which is provided out-of-the-box. The JupyterHub notebook servers work the same as regular notebook servers and users can select Python kernels for plain Python code or PySpark kernels and Sparkmagic to access EMR 6 clusters.
We also support JupyterLab, a next generation user interface for Jupyter notebooks. JupyterLab allows you to arrange multiple panes in the same interface which enhances productivity. The panes can include one or more instances of notebooks, text files, terminals, or consoles. By supporting both traditional notebooks and JupyterLab notebooks, engineers and scientists can choose which interface they prefer.
The Infrastructure team manages all aspects of JupyterHub and its ancillary components including Sparkmagic, Livy, and multi-tenant EMR Spark clusters. The one exception is that some Sparkmagic configuration settings can be changed by users. These settings allow our users to both specify unique Spark configurations and to define their own Docker images for their Spark application running on the multi-tenant EMR cluster.
Supporting both ad-hoc analytics and batch ETL jobs
Our data scientists and data engineers spend a significant amount of time performing ad-hoc analytics. The work is mostly interactive, requires quick turnaround, and a user is often unable to predict how many resources will be needed. Within 2 minutes our users have access to a
notebook and automatically generated Spark session ready to run code on the EMR cluster. Once logged in, the session stays live for the day while a user runs his/her code. We use Sparkmagic inside a Jupyter notebook to provide seamless integration of notebook and PySpark. We forked Sparkmagic to meet our unique security and deployment needs.
ETL jobs are also a big part of our Spark workloads. We use Airflow for orchestration, and pipelines are able to submit Spark jobs to the EMR cluster for the distributed computation. We created a small, in-house Python package to support submitting batch jobs directly to the Livy endpoint of the EMR cluster. While not utilizing our managed notebooks, Airflow jobs also benefit from a managed EMR Spark cluster which alleviates maintenance and operational burden.
Autoscaling EMR with instance groups and YARN node labels
As an infrastructure team, we primarily serve internal stakeholders consisting of over 90 full-stack engineers, data scientists, and data engineers. We wanted our platform to be multi-tenant, supporting a wide variety of users and use cases. Since the number of users and workloads are variable, reliably scaling up and down according to demand while remaining cost efficient was a high priority. We achieved this with instance groups and YARN node labels.
Being able to scale up and down nodes based on usage rapidly and at large scale allows us to provide the required resource within 10 to 15 minutes. With autoscaling rules, it is vital to scale based on appropriate YARN metrics for your needs and ensure that there is no capacity “ping-ponging”, an event where autoscaling rules trigger a scale up event which consequently triggers a scale down event. “Ping-ponging” causes the cluster to be constantly scaling up and down causing job failures and cluster instability.
Autoscaling rules have a direct impact on user experience, and there is a constant battle between improving user experience and managing infrastructure AWS costs. On one end, the users need to quickly scale up resources and be able to run their jobs, on other hand, keeping idle resources to improve user experience is expensive. There is no one right answer when it comes to autoscaling rules, and the rules need to be recursively adjusted to find the right balance between costs and user experience.
Efficient Resource Utilization with Spark Dynamic Allocation
Having Spark dynamic allocation enabled was key for us and why we chose EMR and the YARN resource manager. EMR has dynamic allocation enabled by default, which efficiently allocates resources to a Spark application. With ad-hoc jobs, it is difficult to predict resource utilization of a notebook, and it’s not feasible to allocate resources at a cell level. With static resource allocation, a fixed amount of resources will be allocated to the notebook for as long as the notebook is running, regardless of its computation needs, or whether it’s busy or idle. With dynamic allocation enabled, Spark will determine the appropriate amount of resources to allocate to each cell in the notebook based on how many tasks are created. If the notebook is running but idle, only the driver will be allocated. Users have the ability to hold executors when the notebook might be idle by caching dataframes, however, we can control how long a data block can be cached while the executor is idle before it is removed.
Multi-tenancy and Dependency Management with Containers
We leveraged the support for Docker containerization in EMR 6 as the foundation for our multi-tenancy Spark cluster. Without containerization, dependency management becomes challenging since users could require conflicting versions on a single cluster. Users can define their own Docker images with their individual dependencies, and specify driver and executor images for each Spark application running on the cluster. We pushed EMR usage into multi-tenancy use cases. We have worked closely with the AWS EMR service team and support engineers since the first beta release to help them identify issues and receive guidance on proper configurations and workarounds.
Log Shipping Architecture
Debugging Spark applications via EMR logs in S3 is another challenge we solved. We decided to use logstash to pull and filter fields from logs in S3 and push them to AWS ElasticSearch. With proper index management and log retention configuration, AWS ElasticSearch makes logs easily searchable and makes debugging applications much easier.
Observations, Insights, & Lessons Learned
During the 5 months we spent building this platform, we learned a number of lessons and observed several interesting use cases:
- Users in other functional areas beyond engineering began using notebooks. Even for an internal platform, UX matters. Because our new, user-friendly platform, users in other functional areas beyond engineering are now able to use notebooks.
- With notebooks easily accessible, we began seeing more unexpected use cases such as training engineers with notebooks
- During the architecture phases, we anticipated needing to place limits or quotas on backend resources since they were much easier to spin up. Although we’ve had some heavy users, we haven’t encountered this risk.
- Anticipate additional needs for tooling beyond the primary goal. We also invested in tracking usage by user, which was helpful in breaking out hosting costs from the multi-tenant environment. Per user tracking can be rolled up in a team view for chargebacks.
- Communicate and set expectations that there is a stabilization period after a new product launches. After we launched, we knew the platform covered most use cases, but did not cover 100%. Anticipate a burst of support requests once people start using the platform and be prepared to operationalize and continue to improve it. Our team facilitated multiple, company-wide demos and training sessions to educate users.
- Contributions from the internal engineering team increased, which improved our understanding of user needs. Since we’ve launched, we added the Glue Catalog with many more enhancements on the roadmap.
- EMR’s Instance groups are AZ specific to minimize latency. We relied on instance groups and YARN node labels to use different instance types across AZs.
Leveraging Spark on EMR provides the right balance of infrastructure abstraction and management while also allowing us to get coarse-grained configuration to ensure our data scientists and engineers have the tools they need. Deployments become much easier since we can leverage other AWS services such as Route 53 and ALBs.
Overall, we’ve seen great reception from our engineering team. Jupyter notebooks and Spark are available with minimal friction. EMR 6 and other AWS services were key enablers of the platform’s success.
If you’re interested in working on the platform and are an infrastructure lover, we’re hiring.