Using Hadoop to Manage Dark Data


Dark Data is the biggest piece of the pie (Datumize , n.d.) when it comes to Big Data and what lies beneath huge datasets of collected information.

IBM has stated in a report that over 80 percent of all data is dark and unstructured, meaning that it is simply too much data to process, analyse or unlock valuable information from.

This data is termed Dark Data and is mostly unstructured and oftentimes missing or incomplete, which leaves a lot of potential for solutions to be built around high volume analytics and processing.

Hadoop is a highly distributed storage and processing infrastructure that takes on some of the Five V’s characteristics known to exist in most Big Data platforms.

Of these, Hadoop primarily focuses on Volume, Variety and Velocity (Manikandan, Ravi, 2014) in that it is can handle massive datasets with a varying degree of structure and can ingest data at any rate necessary according to the relevant cluster size.

Some of the advantages to using Hadoop are:

  • Scalable
    Can easily scale to any number of nodes.
  • Automatic indexing
    Indexing is an automatic stage during data ingestion.
  • Cost effective
    Commodity hardware can be used as nodes.
  • Flexible
    Any type of schema can be used without having to predefine it.
  • Fast
    MapReduce breaks down jobs into smaller more manageable chunks.
  • Resilient to failure
    If a node fails it is immediately removed from the cluster (Nemschoff, 2013).

While Hadoop offers many benefits and long standing rewards, it is important to note some of the restrictions or disadvantages prior to a project’s commencement:

  • Security concerns
    Hadoop was not designed for Enterprise Data and therefore does not have hardened security, compliance, encryption, policy enablement and risk management in mind (Preimesberger, 2013).
  • Vulnerable by nature
    Traditional data security technologies are built around the concept of a single protected database server being secured, while Hadoop relies on a cluster of nodes which would need to be individually secured and keep up to date.
  • Not fit for small data
    Hadoop was designed to perform optimally with large datasets, so if only smaller datasets are used, there is a much longer process to getting setup and more configuration that needs to be done to achieve a similar output from traditional database management systems.
  • Potential stability issues
    As Hadoop and the ecosystem is actively being developed, it is not as mature or as heavily tested as some older technologies. Therefore, bugs are known to arise from time to time.
  • Joining multiple datasets are tricky and slow
    Developers are very fond of joining tables together (Kumar, 2016) in order to not duplicate information and to be able to always have access to the latest version of a piece of data. Hadoop requires a flatter layer of data storage where data is stored as is and joining large datasets is not recommended for obvious performance reasons.

Hadoop is a perfect candidate to store, index and analyse massive datasets that would otherwise become unusable to traditional database systems.

References

Datumize (n.d.) The Evolution of Dark Data and how you can harness it to make your business Smarter [Online] Datumize.com, Available from: https://datumize.com/evolution-dark-data/ (Accessed on 19th January 2018)

Manikandan, S, G., Ravi, S. (2014) Big Data Analysis Using Apache Hadoop [Online] IEEE.org, Available from: http://ieeexplore.ieee.org/abstract/document/7021746/?reload=true (Accessed on 19th January 2018)

Nemschoff, M. (2013) Big data: 5 major advantages of Hadoop [Online] ITProportal.com, Available from: https://www.itproportal.com/2013/12/20/big-data-5-major-advantages-of-hadoop/ (Accessed on 19th January 2018)

Preimesberger, C. (2013) Hadoop Poses a Big Data Security Risk: 10 Reasons Why [Online] EWeek.com, Available from: http://www.eweek.com/security/hadoop-poses-a-big-data-security-risk-10-reasons-why (Accessed on 19th January 2018)

Kumar, P. (2016) Mastering Hadoop – Pros and Cons of Using Hadoop technologies [Online] Naukri.com, Available from: https://learning.naukri.com/articles/hadoop-technology-advantages-and-disadvantages/ (Accessed on 19th January 2018)