Introduction to Bigdata | Surendra Mohan

Before we start playing around with bigdata for Solr, let us understand bigdata briefly.

Bigdata has been termed for a bundle of data sets that are quite huge and complex, and difficult to process using database management tools or traditional data processing applications. The challenges may constitute of one or more options in terms of capturing, fixing, storing, analyzing, visualizing, sharing, transferring and searching. Such large sets of data are normally generated because of additional information that are derived from analysis of single large set of associated data instead of separate smaller sets of data that hold the same amount of data. In other ways, single large data set contributes more to big data as compared to more than one smaller data set that holds the same data volume. This big data can be calibrated in terms of exabytes, that is, millions of terabytes. So, you may imagine the size of data of your application (in our case, Apache Solr) is bound to handle, without any impact in terms of its performance.

There are a number of factors that describe bigdata. A few and important of them are:

Volume – Increase in data volume can be due to various reasons, such as transaction-based data stored through years, data (unstructured) streaming from social media, increase in sensors or other hardware device and the data being collected across the machines.
Velocity – Data streams in a black-box manner where it is tough to compute, and of course this data stream needs to be taken care on time. To deal with such streaming data in near real time, RFID tags, sensors, and smart metering systems have been.introduced Dealing with such data needs prompt action (that is, time-based action and velocity matters), which is itself a challenge for most of organizations.
Variety – It signifies the data format. The data can in any format, such as structured, numeric data residing in traditional database, the output information from line-of-business applications, unstructured text document, financial transactions, emails, audio, video, and stock related information. Administrating such data formats effectively in corporate world can be compared with a team of soldiers that are struggling with the data formats, that too without any weapon which is sufficient to deal with.
Complexity – In the current generation, data comes from different sources. In order to handle such data, the data from all the sources need to be linked, matched, cleaned up and transformed across systems. Necessity of such activities, add complexity to the presently used data handling method, thus a challenge for us.
Variability – It states the consistency of the data. In addition to the existing challenges such as increase in velocity and variety of data, the consistency of the data flow does matter a lot. While handling huge data, it is quite obvious that you should be ready to face situations wherein data flow is highly inconsistent with periodic peaks. Peaks might be caused due to daily, seasonal or event-triggered data loads and can prove to be a challenge to manage. If we think of handling unstructured data, it is even complex and tough to manage.