Dale Rensing – Connect Worldwide

HPE Developer Community Meetup: Machine Learning Data Version Control (DVC): Reproducibility and Collaboration in your ML Projects

Dale Rensing — Fri, 23 Sep 2022 16:50:42 +0000

Machine Learning Data Version Control (DVC): Reproducibility and Collaboration in your ML Projects

September 28, 2022
10:00 AM Central Time (US and Canada)

In this session, we’ll do a demo where you’ll learn how to manage and make your machine learning projects reproducible with our open-source tool DVC and the DVC extension for VS Code IDE. We will see how to track datasets and models, run, compare, visualize, and track machine learning experiments right in VS Code. We’ll then go over our GitOps-based model registry solution. Search, share, and manage all models with full context around model lineage, version, production status, data used to train model, and more.

Speaker: Alex Kim, Solutions Engineering, Iterative

Alex Kim is a Solutions Engineer at Iterative. His background is in physics, software engineering, and machine learning. In the last couple of years, he became increasingly interested in the engineering side of ML projects: processes and tools needed to go from an idea to a production solution.

Developers: Get free resources and training through the HPE Developer Community portal

Dale Rensing — Thu, 22 Sep 2022 18:25:15 +0000

The HPE Developer Community portal is your door to learning more!

The variety of ways that developers, data scientists, and ML engineers can leverage APIs and data to create new business opportunities is boundless. But if it’s too hard, it’s not worth the struggle, right? That’s why Hewlett Packard Enterprise makes it easy to access the resources you need on the HPE Developer Community portal. Being a huge supporter of open collaboration, HPE opens the door to provide direct access to software engineering experts, APIs, documentation, tutorials, and training – all free and at your disposal anytime from anywhere.

Explore HPE technologies

On the portal you’ll find all things software at HPE. We’ll introduce you to our different software technologies and provide access to different projects, APIs, and GitHub repositories. You’ll get to preview some of the key open source projects we’re involved with. And you’ll learn more about the unified HPE GreenLake experience and how you can take advantage of a well-documented, secure, and scalable framework of APIs offered through the HPE GreenLake Developer portal.

If you’re not sure which technologies best fit your needs, you can check out our persona section where we help you navigate the HPE Developer Community portal according to your job role. On each page you’ll find articles exploring industry-leading methodologies, best practices to set your foundation in place, and solutions to address your most pressing concerns. You’ll also be directed to training webinars, hands-on workshops, and tutorials specific to your needs, as well as Slack channels where you can get direct answers to your questions.

Take advantage of all of our free training

The HPE Developer Community offers a variety of ways to skill up – and they are all free. All you need to do is register so we can make the appropriate resources available to you.

Technology Talks

We offer two types of webinar-style training sessions on a monthly basis – the Munch & Learn Technology Talks and the HPE Developer Meetups. Both the Munch & Learn and the Meetup sessions are hour-long discussions where you can ask questions of the presenters in a virtual meeting room.

The Munch & Learn Technology Talks feature renowned industry technologists sharing thought-leadership insights on popular HPE and open source technologies. They tend to be held the third Wednesday of every month. The HPE Developer Meetups give you an opportunity to connect with other experts and dive in deep to learn more about some of our most exciting technologies. These meetings are generally held the last Wednesday of the month.

And don’t worry … if you happen to miss one of these sessions, you can always catch it on one of our replays posted on the portal. These can be found on the same schedule pages linked above.

On-demand courses

If you’re interested in trying out some of our technologies, take advantage of our unique, hands-on technical training experience – the HPE Developer Workshops-on-Demand. Using Jupyter Notebooks, you’ll get to familiarize yourself with different ways to use our products and see how they work. Accompanying videos make these workshops a breeze to work through. And they’re fun. Collect a participation badge for each workshop you complete and level up from apprentice to super hero to become a legend! You will find a wide variety of over 2 dozen workshops in the catalog and more keep coming, so make sure you check back frequently!

If you’re interested in more traditional Learn On-Demand courses, check out our link to the HPE Ezmeral Learn On-Demand catalog. Here you’ll find over 50 courses on Big Data, AI/ML, zero trust and data security, Apache Spark & SQL Analytics, Kubernetes, and the HPE Ezmeral Data Fabric.

Get and stay informed

One of the key resources found on the HPE Developer Community portal is our blog. Here you’ll find hundreds of articles and tutorials written by software experts. Many of our blog writers are HPE subject matter experts (SMEs), but we also have quite a few written by community members who are users of the products. Many of these articles are step-by-step tutorials that give you tips on how to improve your efficiency in working with a product or how to do something special.

Make sure to subscribe to the HPE Developer newsletter. This is best way to learn about the newest posts to our blog, as well as any new workshops, technology talks, or videos that have just been released. This monthly publication gets delivered to your mailbox the first week of each month and is chock full of informative material.

Join us

The HPE Developer Community is truly just that – a community of software developers, data scientists, data/ML engineers, IT Ops specialists, collaborating to accelerate innovation. Our portal facilitates this by offering valuable resources for free. We keep connected through the HPE Developer Slack channel, Twitter, participating in certain events, and the monthly newsletter. We’re dedicated to helping you achieve your goals of leveraging APIs and data to create new and wonderful business opportunities. Connect with us today.

Dale Rensing is an HPE Developer Community Communications Manager. She has written for the tech community for many years. Dale has worked with the HPE Developer Community since 2019, helping create and promote the many assets HPE offers to developers, data scientists, and ML engineers worldwide. As their editor and communications manager, she connects with members of the HPE Developer Community to facilitate collaboration and engagement.

HPE Developer Community Munch & Learn: Accelerate public sector AI use cases using a powerful ML Ops platform

Dale Rensing — Wed, 07 Sep 2022 18:59:57 +0000

HPE Developer Community
Munch & Learn

Accelerate public sector AI use cases using a powerful ML Ops platform

Sep 21, 2022
08:00 AM Pacific Time

The French governmental agency, Pole Emploi, is collaborating with HPE in an effort to improve its ability to provide social benefits to the unemployed and help them find work. They have built an effective and scalable AI/ML Ops platform running on HPE Ezmeral Runtime Enterprise that helps companies find and hire workers. Following HPE best practices and using purpose built platforms, they aim to improve productivity across the many teams involved in implementing an ML Ops pipeline, including Big Data teams, business owners, data scientists, and IT Operations. In this talk, François and Dietrich will explore different uses cases where this AI/ML Ops platform has been put into action.

Speakers

François Réthoré
Software Engineer, Pôle Emploi

Dietrich Zinsou
Software Engineer, HPE

It’s All Fun and Games at the Hack Shack!

Dale Rensing — Fri, 22 Apr 2022 18:27:35 +0000

Can you believe it? Hewlett Packard Enterprise (HPE) will be holding its first major in-person conference since 2019 in just over a month! HPE Discover 2022, the edge-to-cloud conference, will be held in Las Vegas at the Venetian/Palazzo Resort from June 28-30^th. The HPE Developer Community team is really excited to be able to engage with attendees in person once again. Make sure you register soon so you don’t miss out!

A place designed especially for you

HPE Discover 2022 attendees will have plenty of opportunity to sit and hear about all the newest HPE solutions and technologies throughout the event. But once you enter the Hack Shack, things get personal! This is where developers, data scientists, and IT technologists have the opportunity to sit down with experts and discuss topics that are important to them. Experts on hot topics like zero-trust security, data lakes, machine learning and analytics workloads, and Open Source projects will be hanging around just so you can get your questions answered.

Within the Hack Shack, you’ll engage with members of the HPE Developer Community in topical meetups. These 30-minute sessions are informal gatherings where you can listen to various subject matter experts (SMEs) and then engage in further discussion on the topic as a group. SMEs will also hang out for a bit after each session in case you want to delve into further detail regarding your own specific needs. (Stay tuned to our next blog post where we’ll get into details on these meetup sessions.)

There will also an opportunity for you to sit down with the HPE UX design team to help them improve the user experience across HPE products and services. Share your opinions and make your voice heard by participating in a variety of user research activities that will inform the design and direction of HPE products and services.

Compete in games and win prizes!

While many come to the Hack Shack to connect with subject matter experts and get their questions answered, many also come to enjoy a little competitive fun with their colleagues. We’ll have a fussball table out on the front lawn, a Jenga and cornhole game in the backyard, chairs to relax in, and a Shield console to play some cool video games. And even through what happens in Vegas typically stays in Vegas, in this case players will be able to return home well-rested with some swag to call their own!

In addition to these popular games, the HPE Developer team has developed a couple of other additional diversions for you. The first one is our Virtual Treasure Hunt. Designed to familiarize players with the HPE Developer Community web portal, this scavenger-hunt style challenge encourages you to check out all the resources that are available on the site, and including the virtual Hack Shack. Be one of the first 12 to successfully complete the challenge and win a $45 gift card!

We have also designed six role-based hands-on challenges you might want to take. Each onsite participant has the opportunity to go home with some nifty swag. Six lucky on-site participants will walk away with a CanaKit Raspberry Pi 4 Extreme kit after having successfully completed one of the challenges and correctly answered the associated quiz. Grand prize winners will be announced at our Hack Shack celebration on Wednesday night. Just a quick note: You must be present to win.

Are you ready to take these challenges?

Cloud Architect Challenge – VM Desired State Management in HPE GreenLake – Your mission is to deploy a VM in HPE GreenLake using the open source configuration management tool, Terraform. For this challenge, you’ll describe the desired state of your environment and use Terraform to analyze and build the necessary infrastructure artifacts. You’ll be provided with everything you need in a nice and friendly Jupyter Notebook environment with little to no code to write. Take an hour to experience Infrastructure-as-Code on HPE GreenLake and earn a chance to win a prize.
Open Source Advocate Challenge – Play with Python or Discover Ansible – Your Choice! – Feeling competitive? Expand your Open Source skills with a chance to win cool prizes. In this challenge, you will be tasked with completing one of two popular HPE DEV Workshops-on-Demand. Choose from Python 101 – A simple introduction to Python programming language or Ansible 101 – Introduction to Ansible concepts and respond correctly to the quiz for a chance to win a prize. An hour should be enough for you to complete this challenge!
Developer Challenge – Building Modern Software with Zero Trust Security – In today’s highly distributed modern software environments, security is a major concern. Who do you trust? In this challenge, choose one of two workshops, SPIFFE – SPIRE 101 – An introduction to SPIFFE server and SPIRE agent security concepts or Creating a Zero Trust Model for Microservices Architectures with SPIRE and Envoy, to understand, in less than an hour, how open source projects SPIFFE and SPIRE enable zero trust security at the heart of your solution and compete for a prize.
ML Engineer Challenge – Deep Learning Model Training at Scale with Determined – Deep learning at scale is difficult, right? Explore the fundamentals of Determined, the open-source deep learning training platform, to learn how it can help. Take this challenge and respond correctly to a quiz to try and win a cool prize. In this challenge, you will train a TensorFlow model in Determined using one GPU, and scale up your training across multiple GPUs, using distributed training, while finding accurate models faster using state-of-the-art hyperparameter search methods.
Data Scientist Challenge – Finding the Data You Didn’t Know You Needed – In this challenge, you’ll get to see how Dataspaces can help you discover new and meaningful datasets that enhance your model building experience, all whilst keeping track of the datasets you know and love, so next time you don’t have to go digging through old notebooks to find them! You’ll even learn how you can share them with your classmates or trade them for valuable tokens! This challenge should take out about an hour and gives you a chance to win an awesome prize.
Data Driven Developer Challenge – Build a house on a Lake! – Know little to nothing about Apache Spark? Maybe you’ve heard of the terms Data Lake and Data Warehouse? Yes or no, you’re perfectly prepared to step up, put on your Data Driven Developer hat and create your first Lakehouse architecture! Ingest data from a Data Lake and convert it to Apache Spark Delta Lake format. Then perform SQL queries on it, create streams of data, and verify ACID compliance! Within an hour you’ll be ready to build a house on a lake and qualify to win a cool prize.

Come party with us

One of the most anticipated events at HPE Discover is the Hack Shack celebration. Planned for Wednesday evening, we’ll be serving refreshments and hosting a very special speaker who will be presenting the CanaKit Raspberry Pi sets to our lucky winners of the role-based challenges.

Stay tuned! There’s more information to come! We’ll be publishing more details on the meetup sessions and the schedule in an upcoming HPE Developer Blog post.

Coding styles: A personal preference or bad practice?

Dale Rensing — Fri, 05 Nov 2021 17:30:08 +0000

Coding styles: A personal preference or bad practice?

Vinicius Monteiro
October 22, 2021

We all have different styles and preferences in everything in life, including how we write code.

Imprinting your personality in the code brings originality and a sense of ownership and responsibility It’s essential to keep us motivated, and it makes us feel good (at least I do). However, is one’s coding style always just a harmless style? Or does it impact readability and hence maintenance?

This has been on my mind a lot lately. For instance, during a code review, I often question whether I should bring specific ways of coding into the discussion or not. How does it affect the application; Is it readable, is it easy to maintain?

Or perhaps I should leave it alone, thinking to myself — Don’t be picky, it’s just their preference, it’s not a matter of right or wrong.

Identifying a programmer’s fingerprint

We could say a developer has a coding identity or ‘fingerprint’, similar to what happens with regular writing. When writing, there is often a pattern with which someone writes — the terms, vocabulary, structure. A linguistic expert, for instance, can identify the author of some anonymous material simply by analyzing these patterns.

Analizing these patterns can even tell things such as the age and place of birth of the author. This technique is called Stylometry. It’s even used in criminal investigations. Machine learning algorithms are used for Stylometry as well — as they can process many texts/books and identify patterns.

We probably can’t tell who committed a crime based on the coding style (can we?). But, let’s say in a team of ten developers, if there are no strict standards to follow, I believe it’s possible to identify who wrote a code block without looking at the author information.

In this post, I’ll list a number of different ways of writing code I’ve encountered throughout my career as a Software Engineer. I’ll focus mostly on Java, but some things are applicable in general.

I’ll also offer my perspective on whether it is just a coding preference that we shouldn’t care about, or if perhaps there is a right (and wrong) way of doing it.

Multiple or single “returns”

One coding practice that tends to reflect a developer’s preference is the use of a single or multiple ‘returns’.

I used to prefer a single ‘return’ at the end of the method, and I still do this sometimes. But more recently, I find that I tend to return where the condition satisfies — I think it’s easier to maintain (it looks uglier, though). You’re more sure of when the method returns a particular value, and you can be certain that any code after the return won’t be executed.

Otherwise, you need to read every if-else or break inside a loop. Often the logic is not as simple as the one presented above.

If I see some complex logic with multiple if-else conditions chained together, mixed with ‘break’ inside loops, etc., and one single return at the end, when a particular value could’ve been returned before — I’d explain my perspective and see if the person agrees with doing the change. However, I wouldn’t push it too much and be picky about it. It’s a subtle benefit that may be hard to convey.

To Else or not?

Another variation I tend to see is whether the coder uses the Else statement. Is it really necessary? I commonly do the version on the left — “Default value with no else” when it’s a simple variable assignment case. It just feels cleaner to me.

A counter-argument could be that the first example uses fewer resources, because you start with null, and only one value (A or B) is assigned at max. On the other hand, a maximum of two variable assignments could happen (if booleanFlag is true). I’d agree with that, but not for all cases. Setting a default first would be fine. It depends on what is being executed as the ‘default’.

This example was one ‘challenge’ that my Bachelor course coordinator threw at us newbies in the first semester during a programming class — “How could you rewrite the first version in fewer lines?!”

No one in the class could answer it. Everyone was still coming to terms with the fact that the course wasn’t really about learning Microsoft Office.

Although I prefer the second version (for a simple variable assignment), I’d probably not bring it up to discuss or ask to change in a code review.

Curly braces or not

Curly braces are used to delimit the start and end of a block of code. Curly braces become ‘optional’ when only one statement is inside an IF condition, a While or For loop.

Both code snippets do the same thing; there is no difference functionality wise. Which one do you prefer?

For me, I’m totally in favour of using curly braces, always. It shouldn’t be optional. I think that mainly because, in languages like Java, the indentation doesn’t drive what will be executed as part of the if condition or loop (for Python, it does, for example). Indentation only, without curly braces, cannot be relied on — it may trick you into thinking that something will be executed (or won’t be executed) when it won’t (or when it will).

So NOT using curly braces may lead to hidden bugs and bad readability in general. In contrast, using it leaves no room for doubt on which line will run or not. It becomes easier to maintain, in my view.

Here are some examples to help illustrate what I mean:

if (count > 10) System.out.println(1); System.out.println(2); System.out.println(3);

When you read the code above, you think all three lines will be executed if the condition satisfies. But it’s not true. There are no curly braces; hence only one will be printed if, let’s say, the count is equal to eleven. Two and three will be printed in any case, even if, for example, the count is five.

Another example:

int count=15; if (count > 10) if (count > 20) return 1; else return 2; return 3;

The else is aligned with the first if condition, but it’s instead part of the if condition just before. The program returns two as the count is less than twenty.

In a code review, I would probably ask to change it (very politely and diplomatically, of course). The other team member may prefer without curly braces and depending on my position, that’s fine — I wouldn’t push it too much.

Checked or unchecked exception

Exceptions are events that happen outside of the normal flow. It allows programmers to separate the code that deals with the success path from those that deal with errors.

Java has its Exception classes, or the developer can create its own by extending Exception or RuntimeException.

Let’s say there is some particular error validation related to your business. You could create a class, for example, ProductNotFoundException, that extends the Exception class.

Another characteristic of how exceptions in Java work is that there are two types of exceptions: Checked and Unchecked.

Checked exceptions are exceptions that extend the Exception class. Their behaviour is: If a code inside method A throws a checked exception, any method that calls method A must handle the checked exception by either catching or throwing (or perhaps both). The code will not compile otherwise. Extending a Checked exception is a way to force programmers to handle a specific error.

Unchecked exceptions are used for unrecoverable errors. Such errors are not to be handled. Instead, programmers should tackle the root cause that triggers them. Example: NullpointerException.

These exceptions extend RuntimeException, and are different from Checked ones. The caller method is not enforced to handle it by catching or throwing it.

Despite being used for unrecoverable errors, one could create an Unchecked exception. It’s just a matter of extending RuntimeException.

Any method can handle such exceptions, but the compiler doesn’t complain if they don’t. That means that you can have the exception handling code only where it is needed.

I learned and used to code by always using Checked exceptions. You probably learned that way too. If you implement a method that calls another method A that generates a checked error, the compiler will tell you that you need to do something about it. And the intent of who created method A was exactly that, to alert and force others to handle the error.

Oracle does recommend always using a Checked exception if you expect to recover from the error. https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html

Here’s the bottom line guideline: If a client can reasonably be expected to recover from an exception, make it a checked exception. If a client cannot do anything to recover from the exception, make it an unchecked exception.

I admit that despite Oracle’s recommendation and being a good practice to use Checked exceptions, I have extended RuntimeException before. I understand that throwing exceptions should be considered just as essential as the method’s parameters and return value; it’s part of the method programming interface.

But, no one creates a method that receives a parameter and does nothing with it. I find that throwing it, and re-throwing it upstream without doing any meaningful handling (log, return message to the user), creates a bit of clutter. It’s unnecessary.

With unchecked exceptions, only the method that generates the error and the one that handles it needs to deal with it. It’s a calculated risk I choose to take sometimes. It’s a calculated risk in the sense that an error that is supposed to be handled may not be — another developer that calls your method won’t be alerted by the compiler to handle the exception that your method raises. It’s a drawback from making it Unchecked, that’s why is considered a bad practice in general.

If I see that one of the team members chose to create and use an Unchecked exception, I would probably want to know the thought process and make sure they know the pros and cons.

Using an If then versus an Else exception

Another thing programmers tend to differ on is the use of an If then exception when an Else exception would also work.

I’ve experimented with both ways throughout my career as a developer. Today I prefer the version on the right — “If then exception”. I see it as clearer — easier to read where errors are generated. And I usually have it aside from the main logic in a private ‘validate’ method.

I would probably try to change in a code review if one of my peers uses the second version. Unless the code is as simple as the example in this section — then I’d leave it (maybe).

Positioning the curly braces

This one is just cosmetics. It’s silly. It’s like preferring a toast cut diagonally versus horizontally or vertically (You’re probably wondering — How can someone not choose diagonally?! Anyway…).

I find it funny that, even in things like this, people have preferences.

I prefer the first one, with curly braces in the same line as if condition or loop. I don’t see any considerable benefit of one or the other. I would not ask another programmer to change it.

Final thoughts

Certain coding style choices are personal, with no benefit or cons over others. It’s like preferring blue to red, orange to apple.

However, some other preferences are more arguable — does it make the code less readable or error-prone? Out of the stylistic differences I covered, the no use of curly braces and checked versus unchecked exception examples stand out. These are the ones with the most impact.

Even if there are coding standards set in place, it’s probably best if they aren’t too rigid. One still needs to allow the developer a certain amount of leeway to make their own personal mark. If it were up to me, I would set a rule to always use curly braces and, possibly, to use checked exceptions (because it tends to be safer), but that’s about it. In the end, it should be discussed and agreed upon as a team.

How fine-grained data placement helps optimize application performance

Dale Rensing — Fri, 05 Nov 2021 16:57:12 +0000

How fine-grained data placement helps optimize application performance

Ellen Friedman
October 22, 2021

Does data locality matter? In an ideal world, after all the work you put into developing an analytics or AI application, you would have unlimited access to resources to run the application to get top performance when it’s deployed in production. But the world is not perfect.

Access may be hampered through latency caused by distance, limitations on compute power, transmission mediums, or poorly optimized databases. What can you do about these issues and how does fine-grained control of data locality help?

Getting the resources your applications need

Even though there are limitations on the availability of resources in any shared system, you don’t need resources at the same level at all points in the lifecycle of your data. For instance, with AI and machine learning projects, data latency and computational requirements change at various stages in the lifetime of models. The learning process, when models are trained, tends to be compute-intensive as compared to requirements when models run in production. Model training also requires high throughput, low-latency data access. It might seem ideal (if there were no limitations on resources) to run your entire machine learning project on high-performance computing (HPC) machines with specialized numerical accelerators such as graphical processing units (GPUs).

But while these are ideal for the model training phase, they may be less useful for the high-bulk, low-compute steps of raw data ingestion, data exploration, and initial processing for feature extraction. Instead, you may get the best net performance by carrying out these operations with data on traditional spinning media, especially if you live on a fixed budget (like everyone I know).

The point is that it’s not just real-world limitations on resources that drive the need to place data on different types of storage media. For top performance, you’d want to consider how the application processes the data. For certain, you’d want flexibility in any event.

You should be able to get the resources your applications need — when you need them and where you need them.

To make this work, the system on which your applications are deployed must allocate resources efficiently. The good news is that with a data infrastructure engineered to support scale-efficiency through granular data placement, it’s easy to optimize resource use and, in turn, to maximize application performance. Here’s how.

Match storage type to data requirements to maximize performance

The key to optimizing application performance and resource usage is to be able to match data throughput, latency and total size requirements with the appropriate type of storage media. Keep in mind that to get the full benefit of high-performance storage, it’s important to support GPUs and other accelerators from a data point of view.

In large systems, this optimization is accomplished by giving dedicated HPC machines high-performance storage, such as solid-state disks (SSDs) or nVME drives, and provisioning regular machines with slower, spinning media (HDDs), capable of handling large amounts of data storage at a lower cost. This type of large-scale cluster is depicted in Figure 1.

Figure 1. Large cluster containing a combination of dedicated, fast-compute/fast storage nodes (orange) and regular nodes/slower storage devices (green)

In the figure above, the orange squares represent SSDs, and orange lines represent machines with computational accelerators (such as GPUs). Green cylinders stand for slower spinning storage media (HDDs) and servers with green lines indicate traditional CPUs. In a typical machine learning/AI scenario, raw data is ingested on the non-HPC machines, where data exploration and feature extraction would take place on very large amounts of raw data. In a scale-efficient system, bulk analytic workloads, such as monthly billing, would also take place on the non-HPC (green) machines.

Once feature extraction is complete, training data is written to fast storage machines (orange) with SSDs and GPUs, ready to support the model training process. Other compute-intensive applications, such as simulations, would also run on the fast machines.

Smaller systems (clusters with less than 20 machines) often cannot afford dedicated HPC machines with high-performance storage. Instead, the need for high-performance computing is met by employing some heterogeneous mix — nodes with fast-compute capabilities but with a mix of different kinds of data storage devices rather than just SSDs. This arrangement is shown in Figure 2.

Figure 2. Small cluster containing fast-compute nodes (orange) having a mixture of SSDs (orange squares) plus slower HDDs (green cylinders) and regular nodes with HDDs only.

Similar to the earlier example, you need a way to assign what data will be placed on which machines. Fortunately, HPE Ezmeral Data Fabric lets you use storage labels to do just that.

Fine-grained data locality with HPE Ezmeral Data Fabric

HPE Ezmeral Data Fabric is a highly scalable, unifying data infrastructure engineered for data storage, management, and motion. Data fabric is software-defined and hardware agnostic. It lets you conveniently position data at the level of different racks, machines, or even different storage types within machines.

Figure 3 below shows how easy it is to create a data fabric volume, assign topology and apply data placement policies via storage labels. (A data fabric volume is a data management unit holding files, directories, NoSQL tables, and event streams all together that act like directories with superpowers for data management. Many policies, including data placement, are assigned to volumes.)

Figure 3. Screenshot of the control plane for HPE Ezmeral Data Fabric.

What happens if you need cross-cutting requirements for data placement? Data fabric lets you define data locality to address multiple goals, such as placement across multiple racks within topologies designated for different failure domains plus additional requirements for particular storage media imposed by assigning storage labels. Locality of the data fabric volume would have to meet both requirements.

Figure 4 illustrates an example of fine-grained data placement accomplished using storage labels in a cluster with heterogeneous machines.

Figure 4. Using the storage labels feature of HPE Ezmeral Data Fabric for differential data placement on particular types of storage devices at the sub-machine level.

Benefits of high-performance metadata with HPE Ezmeral Data Fabric

Performance in distributed systems running many different applications is further enhanced by fine-grained data placement using HPE Ezmeral Data Fabric storage labels. This capability lets you easily assign data locality down to the level of storage pools, a unit of storage within a machine made up of multiple disks. To understand how this additional performance boost works, you’ll need a little background information about the data fabric and to understand how metadata is handled.

HPE Ezmeral Data Fabric uses a large unit of data storage, known as a data fabric container (not to be confused with a Kubernetes container, despite the similarity in the name) as the unit of replication. Data replication is an automatic feature of the data fabric – the basis for data fabric’s self-healing capabilities – with data replicas spread across multiple machines by default. But you can also specify particular data placement policies, and data fabric containers and their replicas will automatically be placed according to the policies you apply.

Data fabric also has a special container, known as a name container, which holds metadata for the files, directories, tables, and event streams associated with a data fabric volume. The name container is a strength of the HPE Ezmeral Data Fabric design because it provides a way for metadata to be distributed across a cluster, resulting in extreme reliability and high performance.

With the fine granularity for data placement afforded by the storage labels feature, data fabric containers and their replicas can have one placement policy while the name container can have a different policy. As Figure 4 shows, you can apply a label “Warm” to position data for bulk workloads on storage pools with slower devices while maintaining the metadata for that volume on fast solid-state devices by applying the label “Hot” to the name container.

This situation can result in significant throughput improvements in processes such as massive disk-based sorts where a very large number of spill files must be created quickly (requiring super-fast meta-data updates on SSDs) and then these spill files must be written very quickly (requiring fast sequential I/O that hordes of hard drives can provide). The combination can work better than either option in isolation by providing the right resources for the right micro-workloads.

Making the most of fine-grained data placement

Turns out you don’t need unlimited resources to get excellent performance for your applications when you take advantage of the fine granularity of data placement afforded by HPE Ezmeral Data Fabric. You can easily assign data topologies when you create a data volume, and you can use convenient storage labels for differential data placement on particular types of storage devices even down to different storage pools within machines. And with the added capability of placing metadata independently of data containers, you can further optimize performance for both bulk applications and in situations using many small files.

To find out more about the capabilities provided by HPE Ezmeral Data Fabric visit the data fabric platform page in the HPE Developer Community.

For a hands-on workshop highlighting data fabric volumes, go to HPE Ezmeral Data Fabric 101 – Get to know the basics around the data fabric.

To learn about data access management using HPE Ezmeral Data Fabric, read the New Stack article Data Access Control via ACEs vs ACLs: The power of “AND” and “NOT”.

Dale Rensing – Connect Worldwide

HPE Developer Community Meetup: Machine Learning Data Version Control (DVC): Reproducibility and Collaboration in your ML Projects

Machine Learning Data Version Control (DVC): Reproducibility and Collaboration in your ML Projects

September 28, 202210:00 AM Central Time (US and Canada)

Speaker: Alex Kim, Solutions Engineering, Iterative

Developers: Get free resources and training through the HPE Developer Community portal

The HPE Developer Community portal is your door to learning more!

Explore HPE technologies

Take advantage of all of our free training

Get and stay informed

Join us

HPE Developer Community Munch & Learn: Accelerate public sector AI use cases using a powerful ML Ops platform

HPE Developer Community Munch & Learn

Accelerate public sector AI use cases using a powerful ML Ops platform

Sep 21, 2022 08:00 AM Pacific Time

Speakers

François RéthoréSoftware Engineer, Pôle Emploi

It’s All Fun and Games at the Hack Shack!

Coding styles: A personal preference or bad practice?

Coding styles: A personal preference or bad practice?

Identifying a programmer’s fingerprint

Multiple or single “returns”

To Else or not?

Curly braces or not

Checked or unchecked exception

Using an If then versus an Else exception

Positioning the curly braces

Final thoughts

How fine-grained data placement helps optimize application performance

How fine-grained data placement helps optimize application performance

Getting the resources your applications need

Match storage type to data requirements to maximize performance

Fine-grained data locality with HPE Ezmeral Data Fabric

Benefits of high-performance metadata with HPE Ezmeral Data Fabric

Making the most of fine-grained data placement

September 28, 2022
10:00 AM Central Time (US and Canada)

HPE Developer Community
Munch & Learn

Sep 21, 2022
08:00 AM Pacific Time

François Réthoré
Software Engineer, Pôle Emploi