These days, when data needs to be instantly accessible, stored forever and available through a variety of devices, the demands on storage systems are changing rapidly. No longer is it good enough to build storage silos utilizing non-web protocols that are tied to specific applications. Social media, online video, user-uploaded content, gaming, and software-as-a-service applications are just some of the forces that are driving this change. To date, public cloud storage services has risen to meet these new storage needs but not every organizaton can - or should - use public cloud storage.
To accomodate these changing needs, storage systems must be able to handle web-scale workloads with many concurrent readers and writers to a data-store. Some data is frequently written and retrieved, such as database files and virtual machine images. Other data, such as documents, images, and backups are generally written once and rarely accessed. Web and mobile data assets also need to be accessible over the web via a URL to support today's web/mobile applications. A one-size-fits-all data storage solution is therefore neither practical nor economical.
Swift is a multi-tenant, highly scalable and durable object storage system that was designed to store large amounts of unstructured data at low cost via a RESTful http API. "Highly scalable", means that it can scale from a few nodes and a handful of drives to thousands of machines and dozens of Petabytes of storage. Swift is designed to be horizontally scalable–there is no single point-of-failure. Swift is also ideal for storing and serving content to many, many concurrent users - a charateristic which differentiates it from other storage systems.
As one of the two initial components of the OpenStack project, Swift is used to meet a variety of needs. Swift’s usage ranges from small deployments for “just” storing VM images, to mission critical storage clusters for high-volume websites, to mobile application development, custom file-sharing applications, data analytics and private storage infrastructure-as-a-service. Swift is open-sourced under the Apache 2 license and now has over 70 contributors, and new developers are contributing every year.
What differentiates Swift from many other storage systems is that it originated in a large-scale production environment, which means that it was designed to withstand hardware failures without any downtime and provide operations teams the means to maintain, upgrade and enhance a cluster while in flight. Swift also scales linearly so operations can add storage capacity when it is needed without worrying about performance overhead costs. By focusing on being a great web-based storage system, Swift can also optimize for that use case. Trying to unify all storage needs into one system increases complexity and reduces stability.
The purpose of this architecture overview is to help those who are considering deploying an object storage system based on OpenStack Swift and complements the official Swift documentation which is available at http://swift.openstack.org. Not every topic related to getting Swift up and running in your environment is covered in this document, but it provides an overview of the key building blocks of Swift, how Swift works and some general deployment considerations.
The key characteristics of Swift include:
Developers can either write directly to the Swift API or use one of the many client libraries that exist for all popular programming languages, such as Java, Python, Ruby and C#. Amazon S3 and RackSpace Cloud Files users should feel very familiar with Swift. For users who have not used an object storage system before, it will require a different approach and mindset than using a traditional filesystem.
All communication with Swift is done over a REST-ful HTTP API. Application Developers who'd like to take advantage of Swift for storing content, documents, files, images etc. can use one of the many client libraries that exist for all all popular programming languages, including Java, Python, Ruby, C# and PHP. Existing backups, data protection and archiving applications which currently support either Rackspace Cloud Files or Amazon S3 can also use Swift as their storage back-end with minor modifications.
As Swift has a REST-ful API, all communication with Swift is done over HTTP, using the HTTP verbs to signal the requested action. A Swift storage URL looks like this:
Swift’s URLs have four basic parts. Using the example above, these parts are:
/, so pseudo-nested directories are possible.
To get a list of all containers in an account, use the
GET command on the account:
To create new containers, use the
PUT command with the name of the new container:
To list all object in a container, use the
GET command on the container:
To create new objects with a
PUT on the object:
POST command is used to change metadata on containers and objects.When planning a Swift deployment, the first step is to define the application workloads and functional requirements that will determine how your Swift
Several client libraries for Swift are available, including:
There are several tools which are compatible with Swift, including storage gateways, file managers, backup tools and filsystem adapters. Here is a list of some of the tools which are compatible with Swift:
The components that enable Swift to deliver high availability, high durability and high concurrency are:
The Proxy Servers are the public face of Swift and handle all incoming API requests. Once a Proxy Server receive a request, it will determine the storage node based on the URL of the object, e.g.
https://swift.example.com/v1/account/container/object. The Proxy Servers also coordinates responses, handles failures and coordinates timestamps.
Proxy servers use a shared-nothing architecture and can be scaled as needed based on projected workloads. A minimum of two Proxy Servers should be deployed for redundancy. Should one proxy server fail, the others will take over.
The Ring maps Partitions to physical locations on disk. When other components need to perform any operation on an object, container, or account, they need to interact with the Ring to determine its location in the cluster.
The Ring maintains this mapping using zones, devices, partitions, and replicas. Each partition in the Ring is replicated three times by default across the cluster, and the locations for a partition are stored in the mapping maintained by the Ring. The Ring is also responsible for determining which devices are used for handoff should a failure occur.
The Ring maps partitions to physical locations on disk.
Swift allows zones to be configured to isolate failure boundaries. Each replica of the data resides in a separate zone, if possible. At the smallest level, a zone could be a single drive or a grouping of a few drives. If there were five object storage servers, then each server would represent its own zone. Larger deployments would have an entire rack (or multiple racks) of object servers, each representing a zone. The goal of zones is to allow the cluster to tolerate significant outages of storage servers without losing all replicas of the data.
As we learned earlier, everything in Swift is stored, by default, three times. Swift will place each replica "as-uniquely-as-possible" to ensure both high availability and high durability. This means that when chosing a replica location, Swift will choose a server in an unused zone before an unused server in a zone that already has a replica of the data.
When a disk fails, replica data is automatically distributed to the other zones to ensure there are three copies of the data
Each account and container is an individual SQLite database that is distributed across the cluster. An account database contains the list of containers in that account. A container database contains the list of objects in that container.
To keep track of object data location, each account in the system has a database that references all its containers, and each container database references each object
A Partition is a collection of stored data, including Account databases, Container databases, and objects. Partitions are core to the replication system.
Think of a Partition as a bin moving throughout a fulfillment center warehouse. Individual orders get thrown into the bin. The system treats that bin as a cohesive entity as it moves throughout the system. A bin full of things is easier to deal with than lots of little things. It makes for fewer moving parts throughout the system.
The system replicators and object uploads/downloads operate on Partitions. As the system scales up, behavior continues to be predictable as the number of Partitions is a fixed number.
The implementation of a Partition is conceptually simple -- a partition is just a directory sitting on a disk with a corresponding hash table of what it contains.
*Swift partitions contain all data in the system.
In order to ensure that there are three copies of the data everywhere, replicators continuously examine each Partition. For each local Partition, the replicator compares it against the replicated copies in the other Zones to see if there are any differences.
How does the replicator know if replication needs to take place? It does this by examining hashes. A hash file is created for each Partition, which contains hashes of each directory in the Partition. Each of the three hash files is compared. For a given Partition, the hash files for each of the Partition's copies are compared. If the hashes are different, then it is time to replicate and the directory that needs to be replicated is copied over.
This is where the Partitions come in handy. With fewer "things" in the system, larger chunks of data are transferred around (rather than lots of little TCP connections, which is inefficient) and there are a consistent number of hashes to compare.
The cluster has eventually consistent behavior where the newest data wins.
*If a zone goes down, one of the nodes containing a replica notices and proactively copies data to a handoff location.
To describe how these pieces all come together, let's walk through a few scenarios and introduce the components.
A client uses the REST API to make a HTTP request to
PUT an object into an existing Container. The cluster receives the request. First, the system must figure out where the data is going to go. To do this, the Account name, Container name and Object name are all used to determine the Partition where this object should live.
Then a lookup in the Ring figures out which storage nodes contain the Partitions in question.
The data then is sent to each storage node where it is placed in the appropriate Partition. A quorum is required -- at least two of the three writes must be successful before the client is notified that the upload was successful.
Next, the Container database is updated asynchronously to reflect that there is a new object in it.
A request comes in for an Account/Container/object. Using the same consistent hashing, the Partition name is generated. A lookup in the Ring reveals which storage nodes contain that Partition. A request is made to one of the storage nodes to fetch the object and if that fails, requests are made to the other nodes.
Large-scale deployments segment off an "Access Tier". This tier is the “Grand Central” of the Object Storage system. It fields incoming API requests from clients and moves data in and out of the system. This tier is composed of front-end load balancers, ssl- terminators, authentication services, and it runs the (distributed) brain of the object storage system — the proxy server processes.
Having the access servers in their own tier enables read/write access to be scaled out independently of storage capacity. For example, if the cluster is on the public Internet and requires ssl-termination and has high demand for data access, many access servers can be provisioned. However, if the cluster is on a private network and it is being used primarily for archival purposes, fewer access servers are needed.
As this is an HTTP addressable storage service, a load balancer can be incorporated into the access tier.
Typically, this tier comprises a collection of 1U servers. These machines use a moderate amount of RAM and are network I/O intensive. As these systems field each incoming API request, it is wise to provision them with two high-throughput (10GbE) interfaces. One interface is used for 'front-end' incoming requests and the other for 'back-end' access to the object storage nodes to put and fetch data.
For most publicly facing deployments as well as private deployments available across a wide-reaching corporate network, SSL will be used to encrypt traffic to the client. SSL adds significant processing load to establish sessions between clients; more capacity in the access layer will need to be provisioned. SSL may not be required for private deployments on trusted networks.
The next component is the storage servers themselves. Generally, most configurations should have each of the five Zones with an equal amount of storage capacity. Storage nodes use a reasonable amount of memory and CPU. Metadata needs to be readily available to quickly return objects. The object stores run services not only to field incoming requests from the Access Tier, but to also run replicators, auditors, and reapers. Object stores can be provisioned with single gigabit or 10 gigabit network interface depending on expected workload and desired performance.
Currently 2TB or 3TB SATA disks deliver good price/performance value. Desktop-grade drives can be used where there are responsive remote hands in the datacenter, and enterprise-grade drives can be used where this is not the case.
Desired I/O performance for single-threaded requests should be kept in mind. This system does not use RAID, so each request for an object is handled by a single disk. Disk performance impacts single-threaded response rates.
To achieve apparent higher throughput, the object storage system is designed with concurrent uploads/downloads in mind. The network I/O capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired concurrent throughput needs for reads and writes.
Unlike most other storage systems, Swift can scale in two ways: As your Swift cluster grows in usage and the number of requests increase, performance doesn't degrade. To scale up, the system is designed to grow where needed -- by adding adding proxy nodes as requests increase, and growing network capacity where choke points are detected. The servers that handle incoming API requests scale up just like any front-end tier for a web application. The system uses a shared-nothing approach and employs the same proven techniques that have been used to provide high availability by many web applications.
Since all content in Swift is available via http, it also becomes very straightforward to either cache popular content locally or integrate with a CDN, such as Akamai. To add more storage capacity to a Swift cluster, just add more drives and nodes, which Swift will incorporate into its resources.
Swift is architected to withstand hardware failures without any downtime and provided operations teams the means to maintain, upgrade and enhance a cluster while in flight. To achieve this level of durability, objects are distributed in triplicate across the cluster. A write must be confirmed in two of the three locations to be considered successful. Auditing process run to ensure the integrity of data. Replicators run to ensure that a sufficient number of copies are in the cluster. In the event that a device fails, data is replicated throughout the cluster to ensure that three copies remain.
Another feature is the ability to define failure zones. Failure zones allow a cluster to be deployed across physical boundaries, each of which could individually fail. For example, a cluster could be deployed across several nearby data centers, enabling it to survive multiple datacenter failures.
Swift is licensed under the permissive Apache 2 open source license. As an open source project, Swift provides the following benefits to its users:
As the source code is publicly available, it can be reviewed by many more developers than what is the case for proprietary software. This means that potential bugs also tend to be more visible and more rapidly corrected than for proprietary software. In the long term, "open" generally wins -- and Swift might be considered the Linux of storage.
Access to the Swift object storage system is through a REST API, which is similar to the Amazon.com S3 API and compatible with the Rackspace Cloud Files API. This means that (a) applications that are currently using S3 can use Swift without major re-factoring of the application code and (b) applications that like to take advantage of both private and public cloud storage can do so as the APIs are comparable.
Since Swift is comparable with public cloud services, developers & systems architects can also take advantage of a rich ecosystem of commercial and open-source tools is available for these object storage systems.
If you look under-the-hood, Swift is built on proven components that work in large-scale production environments, such as rsync, MD5, sqlite, memcache, xfs and python. Swift runs on off-the-shelf Linux distributions such as Ubuntu, which is different from most other storage systems, which run on proprietary or highly-customized operating systems. By focusing on being a great web-based storage system, Swift can also optimize for that use case. Trying to unify all storage needs into one system increases complexity and reduces stability.
From a hardware perspective, Swift is designed ground up to handle failures so that reliability on the individual component level is less critical. Thus, regular desktop drives can be used in a Swift cluster rather than more expensive "enterprise" drives. Hardware quality and configuration can be chosen to suit the tolerances of the application and the ability to replace failed equipment.
For organizations uncomfortable storing their data in a public cloud, Swift is an excellent alternative which allows you to retain control over network access, security, and compliance. Cost is also a major factor for bringing cloud storage in-house. Public cloud storage costs include per-GB pricing plus data transit charges, which can become very expensive. With the declining cost of hardware and drives, the total cost of ownership for a Swift cluster can be on par with S3 for a small cluster and much less then S3 for a large cluster.
The network latency to public storage service providers may also be unacceptable. A private deployment can provide lower-latency access to storage, as required by many applications. Also, applications may have large volumes of data in flight, which can't go over the public Internet.
For the above reasons, organizations can use Swift to build an in-house storage system that has similar durability/accessibility properties and is compatible with the suites of tools available for public cloud storage systems.
As an OpenStack project, Swift has the benefit of a rich community, which includes more than 100 participating companies and 1000+ developers. The following support options are available for Swift:
Swift and SwiftStack offers a real alternative to proprietary object storage systems and is much easier to use than traditional file-system based approaches. Swift is provided under the Apache 2 open source license, is highly scalable, extremely durable and runs on industry standard hardware. Swift also has a compelling set of compatible tools available from third parties and other open source projects. With the SwiftStack Platform, deployment, on-going management and monitoring can now be done with ease.
If you’d like to learn more about Swift and SwiftStack, contact us at firstname.lastname@example.org.