This article is translated by the Ceph China Community – QiYu, with proofreading by BanTianHe.

Original source:Using Ceph with MySQL Welcome to joinCCTG

Percona has started to experiment with distributed MySQL clusters that are aware of the upper layer based on Ceph, using the snapshot, backup, and HA features provided by Ceph to solve the underlying storage issues of distributed databases.

In the past year, the world of Ceph has attracted me. Partly because of my taste for distributed systems, but also because I believe Ceph outlines a great opportunity for specific MySQL and general databases. It’s similar to the transition from local storage to distributed storage and from bare disk host configurations to LVM-managed disk configurations.

Most of the work I’ve done with Ceph has been completed in collaboration with Red Hat partners (mainly Brent Compton and Kyle Bader). This work sparked some discussions at the Percona online conference in April and the Red Hat Summit in San Francisco at the end of June. I could write a lot about the experience of using databases on Ceph, and I hope this blog is the first in a long series of blogs about Ceph. Before I start explaining use cases, setting configurations, and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.

Introduction to Ceph

Ceph was created a few years ago by Inktank, an independent subsidiary of DreamHost, a hosting service company. Red Hat acquired Inktank in 2014 and now launches it as a storage solution. OpenStack uses Ceph as its primary storage backend. However, this blog focuses on a more general perspective rather than being limited to virtual environments.

A simplified way to describe Ceph is to say: it is an object storage, like S3 or Swift. This is a correct statement but only mentions a specific point. Ceph has at least two types of nodes, monitors (MON) and object storage daemons (OSD). Monitors are responsible for maintaining a cluster map, or if you prefer, you can call it cluster metadata. Without access information from the monitor nodes, the cluster is useless to you. Redundancy and quorum on the monitor level are very important.

Any valuable Ceph configuration needs at least three monitors, which are lightweight processes that can be deployed alongside OSD nodes (the other node type required for minimum configuration). OSD nodes store data on local disks; a single physical server can host many OSD nodes, however, deploying more than one monitor node on a single server is meaningless. OSD nodes are represented in a hierarchical structure in the cluster metadata (crushmap), which can span data centers, racks, servers, etc. OSDs can also be organized by disk type, for example, some objects are stored on SSD disks, while others are stored on mechanical disks.

Using the information provided by the monitors’ CRUSHMAP, any client can access data based on a pre-designed pseudo-random hash algorithm. This does not require a forwarding proxy, as these proxies can introduce performance bottlenecks and become a significant factor affecting scalability. Smart architectural design, somewhat similar to the NDB API, allows clients to directly access data on data nodes given a cluster mapping managed by NDB management nodes.

Ceph stores data in a logical container called a resource pool. With the definition of the resource pool, some PGs (Placement Groups) are generated. PGs are shards of data within the resource pool. For example, on a Ceph cluster with four nodes, if a resource pool is defined with 256 PGs, then each OSD will have 64 PGs of that resource pool. You can think of PGs as an indirect layer that achieves uniform distribution of data among nodes. At the resource pool level, you can define the number of replicas (Ceph terminology ‘size’).

For ordinary mechanical disks, the recommended value for the number of replicas is 3, while for SSD/Flash disks, it is 2. For temporary testing virtual machine images, I often use 1 replica. A number of replicas greater than 1 means that each PG is associated with one or more other PGs on other OSD nodes. When data is modified, it is synchronously copied to other associated PGs to keep data available in case one OSD node fails.

So far, I have discussed the basics of object storage. However, the ability to automatically update objects makes Ceph stand out and better than other object storage (in my view). The underlying object access protocol RADOS allows any byte of an object to be updated anywhere, as if it were a regular file. This update capability enables object storage to support more advanced applications – such as supporting block devices, RBD devices, and even the network file system CephFS.

When using MySQL on Ceph, the block device characteristics of RBD disks are very appealing. A Ceph RBD disk is essentially a chain of objects (default 4M) recognized by the Linux kernel’s RBD module as a block device. Functionally, it is quite similar to an iSCSI device, as it can be mounted on any node that can access the storage network, and it relies on network performance.

Advantages of Using Ceph

Convenience

In a world pursuing virtualization and containerization, Ceph provides convenient database resource migration between hosts.

IO Scalability

On a single host, you can only access the IO capabilities provided by that node. With Ceph, you can essentially parallelize the total IO capabilities of all hosts. If one host has 1000 IOPS, a 4-node cluster can achieve 4000 IOPS.

High Availability

Ceph replicates data at the storage level and provides availability in case of storage node failures. A kind of DRBD.

Backup

Ceph RBD block devices support snapshots, which are fast and have no performance impact. Snapshots are an ideal way to perform MySQL backups.

Thin Provisioning

You can clone snapshots and mount them as block devices. This is a very useful feature for deploying new database servers for replication, and it can be asynchronous replication or Galera replication.

Advice for Using Ceph

Of course, nothing is free. Using Ceph requires following some advice.

Response to OSD Loss in Ceph

If an OSD fails, the Ceph cluster starts replicating data with fewer replicas than the set value. While this helps with high availability, the replication process noticeably affects performance. This means you cannot run Ceph on nearly full storage; you must have enough disk space to handle the loss of one node.

The noout attribute of OSD mitigates this and prevents Ceph from automatically handling a failure (but you need to handle it yourself afterward). When using the noout attribute, you must monitor and detect that the cluster is running in degraded mode and take action. This is similar to a disk failure in a RAID set. You can set mon_osd_auto_mark_auto_out_in to choose this behavior as default.

Data Scrubbing

Ceph performs scrubbing operations daily and weekly (deep), although it is throttled, it still affects performance. You can modify the interval and execution time that controls the scrubbing actions. Once a day and once a week may be good. However, you should set osd_scrub_begin_hour and osd_scrub_end_hour to limit the specific execution time of scrubbing. And limit the scrubbing itself not to cause excessive load on the nodes. The osd_scrub_load_threshold variable sets this threshold.

Tuning

Ceph has many parameters, so tuning Ceph is complex and confusing. As distributed systems advance hardware, properly tuning Ceph may require distributing interrupt loads between all cores, thread and CPU core binding, and handling NUMA domains – especially if you are using an NVMe device.

Conclusion

I hope this blog provides a good introduction to Ceph. I have discussed the architecture, benefits, and advice regarding Ceph. In future blogs, I will present use cases of using MySQL on Ceph. These cases include tuning XtraDB cluster SST operations with Ceph snapshots, deploying asynchronous slaves, and building HA configurations. I also hope to provide guidance on how to build and configure an efficient Ceph cluster.

Finally, a tip for those who think the cost and complexity of building a Ceph cluster are hard to reach. The image below shows my home cluster (which I use heavily). This cluster consists of four ARM-based nodes (Odroid-XU4), each with a 2TB hard drive with a USB-3 interface, a 16GB EMMC flash drive, and a 1Gb Ethernet port.

I won’t claim to break performance records (although it is good enough), but it’s hard to beat from a cost perspective (around $600).

Friendly Reminder:

All ten lucky winners’ addresses have been received, and the delivery has been arranged. The courier will deliver soon!

Ceph China Community

is the only officially authorized community in China,

providing a communication platform for Ceph enthusiasts!

↓↓↓

Open Source – Innovation – Self-improvement

Official website: www.ceph.org.cn Cooperation email: [email protected]

Submission address: [email protected]

We are recruiting passionate translators,

to participate in community translation of foreign materials.

Exploring Distributed MySQL Clusters with Ceph