Data Storage, Management, and Sharing on Chameleon

Chameleon offers multiple ways to store and share your data, so it is important to understand the differences among types of storage and your needs to help you choose the best storage solution for your experiment. Chameleon storage capabilities are provided via an enhanced version of mainstream OpenStack, including both ephemeral storage and persistent storage (Object Storage, Block Storage, and File-based storage). In this blog, we will discuss all storage types provided by Chameleon and their use cases. Don’t forget to check out our Jupyter Notebook tutorials, Youtube video, and various learning materials mentioned at the end of the blog!
 

Ephemeral Storage 

You can save your data to your instance’s primary block device. Chameleon has various types of storage nodes, such as nodes with NVME SSDs and storage hierarchy nodes. You can also deploy your standalone NFS server and clients via our NFS Share appliance. However, the data written to your instance is not persisted, i.e. it disappears when your instance is terminated. Chameleon offers the cc-snapshot tool for bare metal instances, which allows you to snapshot your disk image. If you use the KVM@TACC site, you can also choose to base your disk image on top of a mountable block device (OpenStack Cinder). You need to be cautious when persisting and sharing your data this way, as we expect bootable images to contain the minimum amount of configuration to start your experiment quickly. Dependencies/software are encouraged to be saved this way. However, saving large datasets with your snapshotted disk image is not recommended, as large images take longer time or fail to launch. For large or important datasets, you should consider using persistent storage.

Object Storage

Object Storage stores data as binary blobs, and allows you to access and manage your binary objects via a REST API. This is the only storage type that allows you to access your data anywhere without an instance, as all other storage types require you to have an active instance. An “object” is not a “file”, but multiple chucks of data. Object storage is suitable for storing large objects at scale and provides an easy way to share your data with others. Objects also contain metadata. Fixed-key metadata includes checksum, security levels, and ACLs. You can also set custom metadata to help you search, process, and use your objects. Chameleon provides the cc-cloudfuse tool to help you mount your Object Store to your instance file system, but there are limitations as you are not dealing with the real files and the real file system. You can read our user documentation for a list of limitations on mounting and accessing your Object Store via cc-cloudfuse tool. Object storage isn’t suitable for transactional data, as objects are immutable and updated in their entirety, i.e. you can not modify a single portion of an object. Therefore, it has a slower performance compared to other cloud storage types. Commercial cloud equivalents of object storage include AWS S3, GCP Cloud Storage, Azure Blob Storage

Block Storage

To fill the gap between KVM and bare metal setups on mountable storage capability, we released the shared file system through OpenStack Manila interface for CHI@UC and CHI@TACC in July, 2022. A “share” is a pre-allocated, remote, and mountable file system backed by our Ceph clusters. The shared file system is persistent and allows you to detach from one instance and attach to another without data loss like OpenStack Cinder volumes. But even better than block storage service, you can mount a share to any number of instances and access the share by several users at a time, thanks to locking capability. Acting exactly the same as your instance’s local file system, all the data is saved together as a file with a file extension, and the files are organized in folders and subfolders. Your share is protected by reservable storage networks and access rules, so that only permitted IPs (instances) can access the share. You can easily mount your share to your local file system via NFS protocol.

Not only your datasets, but you can also save dependencies/software in your share too! This would help you slim your disk image so that your instance can be launched faster. The shared file system also provides an efficient way to collaborate with your project members on files, but please be careful and prevent overwriting each other’s changes with the help of versioning software. Commercial cloud equivalents of file-based storage include AWS EFS, GCP Filestore, Azure Files.

Learning Materials

Our python-chi library helps you interact with all Chameleon storage types listed above. We also have a Jupyter Notebook tutorial for Data Management, which walks you through various ways to store, manage, and share your data. In the tutorial, you can learn about cc-snapshot and cc-cloudfuse tools and how to interact with Chameleon Object Store. In addition, we created a separate Jupyter Notebook for our shared file system. The tutorial explores how to create a share and access the share with a reserved storage network using the python-chi library. Don’t forget to check out the recording of our Science, Storage, and Sanity Webinar (Summer with Chameleon 2022) on Chameleon YouTube channel. Other learning resources are listed below:

Chameleon Changelog for July 2022

This month, we are excited to announce integration with the Fabric testbed! We also have updates to the filesystem, including support for it at CHI@TACC, new nodes with CHI@UC with A100 GPUs and IceLake CPUs, and lots of usability improvements to the testbed.
 

Conducting Storage Research on Chameleon

Chameleon has added lots of new and exciting storage capabilites in recent months - learn all about the resources available, how to conduct storage research on Chameleon, and examples of storage experiments being conducted by researchers at the University of Chicago, Carnegie Mellon University, Virginia Tech, and Rutgers University. There's also a fully packaged experiment and accompanying YouTube video that you can run on Trovi to practice with!

Where Do I Put My Data in Chameleon?

Have you ever lost your data after your instance failed, or are your instances failing to launch with a custom image? You may be handling your data incorrectly in the cloud. Read on to learn how to keep your data persistent and your custom images small.

Transferring Large Data Flows on Chameleon

Ready-to-use Data Transfer Node (DTN) is provided, and it can be used to provide efficient network data transfer over a long fat network. In addition, a Chameleon Complex Appliance is publish for easy spawning a set of DTNs in Chameleon Cloud.


Add a comment

No comments