Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Yael Rivkind
Yael Rivkind Author

November 25, 2020

While Object Storage is not novel technology, it can still be overwhelming when getting started. Here’s a definitive guide to object-based storage with everything you need to know.

What is object storage?

At its core, object storage or object-based storage represents a data storage architecture that allows you to store large amounts of unstructured data in a highly scalable manner.

Nowadays, we need to store more than just simple text information in relational or non-relational databases in tables or documents. Data types include email, images, video, web pages, audio files, datasets, sensor data and other types of media content; a big chunk of unstructured data. Additionally, some studies have shown that somewhere around 80% of the data in any organization is unstructured.

For large enterprises and organizations, storing and managing this unmatched amount of data can be challenging and very costly.

As a solution, the ease of use of object-based storage systems and the benefits they bring makes this architecture the preferred method for data archiving, backup, and more or less for storing any type of static content. While many would expect a large volume of data to be stored poorly, through object storage you can ensure high quality data storage through a data lake model.

What is Object Storage

How does it work?

In object-based storage devices or systems, there are no folders, directories, files, or any hierarchies that you would normally see in a file-based system. Instead, these systems store all data in a flat data environment as objects. Each object contains the data itself, along with some descriptive information associated with that object, known as metadata, and a globally unique identifier

Therefore in object-based storage, we use this information to locate and access the data we saved instead of a file path.

File vs Block vs Object storage

Now, you may be wondering — there are other types of data storage architecture as well, so how do they all compare?Let’s start at the beginning: the most notable architectures are file storageblock storage or object storage. In most cases, if you want to save your data, one of the above should be perfect for your needs.

File Storage

File storage is network-attached storage where data is stored in folders. When a file needs to be accessed the computer must know the full path to find it. A good use case of it is when you have a mix of structured and unstructured data or you want to share the data with many users at once. It can store just about anything and you use it any time you access your files from your PC.

Block Storage

Block storage saves data in raw blocks and different from file storage it can be accessed through a Storage Area Network (SAN). Meaning that in a block storage architecture the servers that access the storage can be on different networks as well. A good use case for block storage is when the requirements imply a very low latency and a good and consistent performance for I/O operations, such as a database.

Object storage

As discussed earlier, object storage is a type of architecture where each file is saved as an object and it can be accessed through an HTTP request, usually GET. This type of storage is the best fit for scenarios where a lot of unstructured data need to be managed. In object storage, each object receives a unique id, that consumers will use to retrieve it and rich metadata that can be from privacy policy up to anything you want.

Why use Object Storage? 

Object storage is the preferred way when it comes to static data management. Here is some more information about why this is a smart choice.

Object Storage Systems Optimize Data 

That object-based storage systems don’t store data via a hierarchical structure is its most notable trait. The lack of folder-based storage not only makes retrieving files easier but also assigns metadata to each piece of data. 

This metadata is customizable, so it also allows for easier and more detailed analysis of the stored data. Finally, thanks to the flat storage structure, you can keep adding data and scaling up your storage system.

Finally, object-based storage is cloud-based storage. This means that it hosts data on a different device than the one we use to access it. This storage mechanism keeps our data safe from any hardware-related problems.

Object Storage providers

There are many providers you can choose to acquire object storage facilities. 

Nirvana

The first-ever object storage architecture was Nirvana. Even though it is not functional anymore, this is worth a mention. Nirvana was a virtual object storage software, developed and released a long time ago when object storage was still a fresh idea.

It was the result of research started in 1995 in response to a project sponsored by DARPA for a Massive Data Analysis System to allow organizations to manage unstructured data that could be hosted on different devices located in different regions of the world.

Although it’s not operating anymore, this is how object storage architecture first started.

Amazon S3 Storage

The most notable provider for object storage is Amazon S3, also known as Amazon Simple Storage Service. Although the technical design is not publicly provided, Amazon S3 aims to be a scalable, easy-to-use, cheap storage service from Amazon with high availability and low latency. 

As in any object-based storage, the basic units of S3 are objects which are organized in buckets. These buckets also provide additional features such as version control and policy rules to better manage the access of the files that you are uploading.

These objects can be managed through the SDK previously mentioned, REST API using HTTP requests, or via the UI they provide. What differentiates them is the demand for the objects. For example, S3-IA can be used for data that is needed less often and S3 Glacier for archival storage.

However, the perks that Amazon brings comes at the cost of price and complexity. The service offers a pay-as-you-go agreement with a price of around $0.023 per GB. This price may vary slightly, based on the total amount of space you use and based on S3 Storage Type, as they offer more than one: S3 Standard, S3 Infrequent Access, and S3 Glacier.

For individuals or small businesses, a cheaper and easier-to-use solution can be more beneficial.

Things to consider when selecting a provider

There are many other factors you should consider while selecting a cloud solution provider for object storage, apart from the starting out cost: 

  • Cost of API calls – can be free or charged
  • Bandwidth – if you exceed the bandwidth offered in the price plan, you will be charged for each extra GB that you transfer; usually $0.01 per additional GB transferred
  • Egress charges – traffic that flows from the private network out to the public internet
  • Edge service (CDN) – can be free or charged

These other aspects are usually known as hidden costs, making it imperative to know everything offered in the package. This can help you accurately estimate the monthly costs of your cloud solution.

How to access object storage? 

The objects in an object-based storage system are accessed via Application Programming Interfaces (APIs).

This space is managed by the graphic interface (dashboard) offered by the provider directly on their website. The dashboard is already integrated with object storage using an API. There are a small number of actions you can take, but all the essentials are present; you can upload, download, edit, and delete your data. Therefore, if you want to integrate object-based storage in your application, you must write the API integration yourself.

For this, there are two APIs available.

Provider API 

The native API for object storage is an HTTP-based REST API. 

These APIs query an object’s metadata to locate the desired data/object. RESTful APIs use HTTP commands like PUT or POST to upload an object, GET to retrieve an object, and DELETE to remove it.

Each provider comes with its own REST API you can use to integrate that specific object storage system with your applications.

Sometimes, using this API can be challenging because of its complexity and many security layers.

In these cases, special headers must be created and signed using different cryptographic algorithms for the content you want to upload.

Amazon S3 API

Amazon S3

Another solution is to use the S3 API from Amazon. This API has become the de-facto API when it comes to object storage. It brings uniformity in communication with the Amazon SDK’s architecture. The SDK is available for most programming languages and is free to use under the Apache-2.0 License. Most well-known vendors offer object storage compatible with this API; check the product description to verify this.

Using this API provides several benefits. One of them is the ease of integration as the code is already written and encapsulated in different methods and properties that you can use. At the same time, because S3 is the standard, it will be much easier to find information and help in the open-source community in case you run into certain problems because you are most likely not the first to encounter them.

Furthermore, if you ever want to move your data from one storage object to another, maybe even another provider, then you don’t have to worry about having to refactor the code. If the new storage object is S3-compatible, then you will most likely not need to change anything; if something needs to be changed, these updates will be minor.

Integrate object storage in your application 

Let’s see how easy it is to use object-based storage.

Previously we mentioned that the Amazon S3 is the preferred choice for all developers. Therefore, that’s what we are going to use here as well. For the purposes of this guide, we will choose NodeJS as the programming language and show only the parts of interest without unnecessary explanations related to the basics of this ecosystem.

As a prerequisite, we need to install the AWS SDK, which is available on npm. It also provides its own type definitions, so you don’t need @types/aws-sdk installed if you want to use it in a TypeScript based project.

To access the object-based storage system, you need to provide the following information about it:

  • secret-access-key and access-key-id – private/public pair of keys that you can generate using different tools or sometimes directly on the provider dashboard
  • endpoint – the web address of the space

You will have access to this piece of information once you acquire an object storage system directly on the provider’s website.

Once you have this information, you can pass it to the AWS SDK as follows:

const AWS = require("aws-sdk");

const s3 = new AWS.S3({
  endpoint: "provider-space-endpoint",
  secretAccessKey: "my-secret-key",
  accessKeyId: "my-access-key",
});

(Remember to replace the dummy text with your own values).

To upload a file to the object-based storage system, we need the following data:

  • bucket – the bucket name where we want to upload (the objects/data saves in a bucket; each space can have one or more buckets depending on the infrastructure of your provider)
  • key – the file key, usually this is the name of the file
  • body – the file content in form of a Buffer|Uint8Array|Blob|string|Readable 
  const params = {
    Bucket,
    Key,
    Body,
  };

  const res = await s3.upload(params).promise();

Tip #1 – Creating a pseudo folder path

We said there are no hierarchies, folders or directories in an object storage system. However, you can use the key of an object to compute a file-like path using prefixes.

For example, if you have a file named cat.jpg and dog.jpg each, you can upload these files with the key property value of my-pets/cat.jpg and my-pets/dog.jpg. This way, if you ever want to get all the files from the pseudo-folder my-pets, you can query the object storage using the prefix my-pets, and it will return all the files that have this exact prefix.

Here is a full example of a code that will upload all the files from Desktop and create a pseudo-folder-path using the keyproperty.

const AWS = require("aws-sdk");
const fs = require("fs");
const path = require("path");
const mime = require("mime-types");

const s3 = new AWS.S3({
  endpoint: "",
  secretAccessKey: "",
  accessKeyId: "",
});

const Bucket = "my-bucket";

async function uploadFile(Key, Body, ContentType) {
  const params = {
    Bucket,
    Key,
    ACL: "private",
    Body,
    ContentType,
  };

  const res = await s3.upload(params).promise();
}

async function uploadFolder(folderName, folderPath) {
  const files = fs.readdirSync(folderPath);

  files.forEach(async (filePath) => {
    const fullFilePath = path.join(folderPath, filePath);

    const buff = fs.readFileSync(fullFilePath);

    await uploadFile(
      `${folderName}/${filePath}`,
      buff,
      mime.lookup(fullFilePath)
    );
  });
}

uploadFolder("Desktop", "C:/Users/user/Desktop");

This code will read (as a buffer) all the files from the Desktop and upload them in the bucket my-bucket as private files. This way, they will not be publicly available on the internet, but rather can be accessed only using specific API keys.

Tip #2 – Multipart Upload

This code was only for demo, because the Desktop folder also holds large files that may exceed the maximum allowed size. Reading a file as a buffer will store its content in the RAM memory.

As a best practice, if your file size exceeds 5MB, you can use a multipart upload. This way, you split the file in chunks of 5MB and upload them individually.

Object Storage Multipart Upload

Once you upload a file to the object-based storage system and set it as a public object, then you can access it using the bucket-name, the endpoint, and the file key.

https://${bucket}.${endpoint}/${key}

  getFileLocationOnSpace(key: string) {
    const bucket: string = this.configService.get('SPACE_BUCKET');
    const endpoint: string = this.configService.get('SPACE_ENDPOINT');

    return `https://${bucket}.${endpoint}/${key}`;
  }

Tip #3

You can see the URL by opening the object directly from the provider dashboard.

Use Cases of Cloud Object Storage 

The object-based storage architecture is perfect for the following cases:

  • Data recovery & backups – with object storage, you can securely and cost-efficiently store your data 
  • Analytics – when you have a large amount of data, for example data that is used to train different AIs or to run analytics, then object storage is a great solution to use it as a data lake
  • Static content – you can use object storage to serve all the static content for your web app or all the content that’s created by your application

Don’t Forget Data Versioning

With the help of the data’s unique ID and metadata, we can achieve a form of data versioning. With data versioning, we can keep multiple variants of an object in the same bucket. This is very helpful if we want to preserve, retrieve, and restore every version of every object stored in our bucket.

Data versioning can be achieved through the S3 API or the REST API provided by your selection cloud solution. For this, the bucket in which you store the files must have versioning capabilities enabled.

Another solution is to use an external tool that will do this for you. For example, you can use lakeFS which will provide a Git-like branching and committing model that scales large amounts of data by utilizing S3 or GCS for storage.

lakeFS S3

Using such a tool will make your data more durable and easier to manage. Suppose an operation that was performed on the object-based storage triggered a change in the quality of the data that you expose; then you can revert instantly to a former, consistent, and correct snapshot of your data lake.

To find out more about lakeFS head on over to the lakeFS GitHub repository and docs.

Conclusion

Object storage comes as a rescue bot for modern infrastructure where more and more data is created, stored, and shared across the internet.

Using this type of architecture will only bring benefits and increase the overall performance, stability, and confidence in the entire infrastructure that you have.

If you want to know more about object-based storage architecture, than take a look on the following articles: 

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started