Editor’s note: This article is by cloud analytics infrastructure expert
Vernik, IBM Research-Haifa.
Today's massive growth in data sets means that storage is increasingly becoming
a critical bottleneck for system workloads. My storage team in Haifa, Israel wants
to analyze and understand these massive volumes of data, and we need to store
them somewhere reliable. Although disk space is an option, it's too
slow to carry out fast Big Data processing. In-memory computing, which keeps
the data in a server’s RAM for fast access and processing, offers a good
solution for processing Big Data workloads – but it’s limited and expensive.
Enter Tachyon, a memory-centric distributed storage system that offers
processing at memory-speed and reliable storage. Its software works with servers
in clusters so there’s plenty of room for storage, and a unique proprietary feature eliminates the need for replication to
ensure fault tolerance. Now, we’ve connected Tachyon to Swift so it can work
effortlessly with Swift and SoftLayer. The result? Tachyon is even more
flexibile and efficient.
Building efficiency into
Big Data analytics
Let’s take something like Facebook’s data. Tons of data need to be stored and analyzed, such as logs,
activities, connections, media, locations, messages, and so forth. A good
solution would be to store them as objects or files in an object store like
Swift. Why as objects? Object storage offer two important things: low-cost
storage and reliability; so even if my computer fails, I know my data is safe.
Several applications can analyze
this data, including Apache Spark and Apache Flink. Often, while one set of
analytics is being done, say to find out what ads to display on your news page,
another user might analyze the same data set to find out which geographies you
have visited most often. In short, we can have different instances of analytics
workloads all reading and writing the same data. Tachyon uses memory
aggressively and can serve up the results to the users so the work doesn’t have
to be done separately, multiple times.
The latest evaluations show that
Tachyon outperforms in-memory HDFS by 110x for writes. It also improves the end-to-end latency of a
realistic workflow by 4x (source).
Faster, cheaper, and reliable, Tachyon solves
Tachyon, a project out of UC-Berkeley’s AMPLab, is intended to help organizations quickly store and access
all that information. (The term tachyon refers to a particle that moves faster than light.) And it has been gaining momentum. We certainly recognize its tremendous
potential for improving the efficiency and fault tolerance of computation
frameworks, such as Spark and Flink. With the help of Tachyon Nexus founder Haoyuan Li and the Tachyon community, we were
able to turn this potential into reality.
Many frameworks like Spark take advantage of memory. When they share the data between different frameworks or jobs, they need to write the data to the different systems, which takes time. And making sure they are synchronized for write, is even more difficult. Tachyon helps achieve memory throughput without unnecessary replication and still provides reliability.
If the computer fails, the system re-computes
the data using lineage, and in this way provides reliability through fault
tolerance. Data lineage is generally defined as a kind of data life cycle that
includes the data's origins and its transformations. Because it’s a distributed
system, Tachyon works as a cluster, using the memory of a whole bunch of
computers. So if one computer fails, there’s no problem. Even if the entire
cluster fails, everything is backed up since Tachyon saves everything on disk
from time to time.
Tachyon is open source and already deployed
in production at multiple companies. In addition, the project has more than 100
contributors from more than 30 institutions, including Yahoo, Tachyon Nexus,
Redhat, Baidu, Intel, and of course, IBM. The project is the storage layer of
the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora
We’ll continue investing effort in
Tachyon so that more organizations can take advantage of the performance boost
offered. This collaboration goes a long way towards preventing repetitive work,
improving memory utilization, and allowing processing to reach new levels of
memory usage – especially when it comes to Big Data analytics frameworks like
Spark and others.
Labels: analytics, Berkeley, cloud, Gil Vernik, Haoyuan Li, IBM Research - Haifa, in-memory processing, SoftLayer, Spark, Swift, Tachyon