- See You, SQL – Hello Hadoop
- Managing distributed Solr Servers
Our team has developed a system for storing and processing huge amounts of log data using Hadoop. The challenge was to handle Gigabytes of log messages every day and enable querying the 30+ Terabyte archive with instantaneous results. In this first part of our blog series we explain our motivation for using Hadoop and contrast this solution with the traditional relational database approach.
Let’s consider the scenario of one of our customers. This customer is using around 40 applications which are hosted on a multitude of servers (200+) in different data centers to manage their business processes. Typically, a single business process involves at least 4 different applications. Each application creates log messages which are stored as text files in the local filesystem. This is fine as long as there are only few applications or servers involved. But in our case, it is not feasible to retrieve the log files directly from around 200 different servers.
The Pros for using a Relational Database
For several years, our customer stored the log data in a central repository using a relational database. This approach has a number of advantages:
- Data Retrieval: A single query submitted to the database can retrieve log messages from multiple servers or applications, so that user interactions may be tracked throughout the whole system.
- Process Monitoring: It is possible to monitor business processes which involve multiple applications.
- Gathering Statistics: Statistics involving multiple applications can be generated easily.
- Access Control: Access to potentially sensitive log data can be controlled on a per user or per group basis. It is possible for employees without access to the application servers to query the log data.
- Data Security: It is possible to delete or blacken sensitive data if necessary.
Storing the log data in a database usually worked out fine for small to medium amounts of log data. The applications created a couple of hundred megabytes of log data a day and the databases are perfectly suited for these tasks.
However, the customer encountered problems when the amount of log data grew to multiple gigabyte per day. The situation became even worse when it turned out to be necessary to query the data several months into the past, thus requiring long-term storage. In our example, we see an growth rate of around 10 GB per day. The database has a total size of around 3 TB. Searching the database is only possible for very limited time ranges. If a user tries to retrieve log data for more than 20-30 minutes, the query crashes after running for an hour.
Impedance Mismatch #1: Log data has no fixed structure
One of the problems of storing log messages in relational databases is that these messages don’t have a fixed structure. Okay, they usually consist of some fixed fields, like for example a timestamp, a logging level, or the name of the logging component or application. But the important part of the log message usually consists of unformatted plain text.
There are different possibilities to store these messages in a relational database. You might just post the whole of the message into a CLOB-field or you might try to store the message into a pre-defined schema. The CLOB approach will lead to problems retrieving the data, since you need to build a full text search index to be able to find a message. The approach with the pre-defined schema might lead to problems when adding a new application which creates log messages with a different format. Messages might also be truncated when they do not fit into the schema.
Relational databases are very powerful for handling structured data. But log messages are inherently unstructured. Log entries can be handled as separate entities which are not related to each other. There is usually no need for the typical relational database capabilities, namely transactions, constraints, and relationships.
Impedance Mismatch #2: Log data is written only once
Data is mostly written once into the database. Modifications of stored data might happen but are very rare. It is necessary that writing data is very efficient. In an environment where hundreds or thousands of log messages are stored every second this task can be difficult to accomplish.
It will quickly become necessary to partition the database schema to distribute the load across multiple hard disks or even servers. Partitioning log data does not work well with a time-based partitioning scheme, because most of the data is inserted in the partition which contains data for the current date. To distribute the load, it is necessary to use another schema, for example by partitioning via the applications. But here we have to deal with the problem that different applications produce different amounts of log data. Finding a balanced partitioning scheme is a challenge and partitioning schemes may change over time.
Looking for specialized Log Processing Solutions
In order to be able to better scale with the growing amount of log data, we were looking for an alternative to the current database solution. Switching to something like Oracle Parallel Server was not an option because of the partitioning issues discussed in the previous paragraph and also because of the very high licensing costs.
The first very capable solution we found was Splunk. Splunk is a very nice product for collecting and analyzing data. It is scalable and can handle many terabytes of data. Licensing costs are coupled to the amount of daily data that needs to be stored. Overall, Splunk is worth a detailed evaluation when you need a product to handle large amounts of user data. In the end, we did not use Splunk because our customer requested a highly customized solution without recurring licensing costs.
We were aware of Google’s papers on handling large data amounts with their Google Filesystem, BigTable and MapReduce concepts. In Hadoop we found an open-source framework released under the Apache License which implements these concepts and enables applications to process Petabytes of data on thousands of servers.
The core of Hadoop is an implementation of the Google Filesystem called Hadoop Distributed Filesystem and also MapReduce. The BigTable concept is implemented in a subproject of Hadoop called HBase. An initial prototypical evaluation showed that the Hadoop framework is a very elegant solution for our scalability problems. Tests with a small cluster consisting of 5 servers with a total cost of around 7,000 Euro revealed a performance 5 times higher than the present solution with only around 5% of its hardware costs.
In the following part of this article series, we will take a deeper look at the Hadoop technologies and introduce our solution.