Tuesday, February 10, 2015

Ways to import unstructured data into Hadoop

There are multiple ways to import unstructured data into Hadoop, depending on your use cases . 
  1. Using HDFS shell commands such as put or copyFromLocal to move flat files into HDFS. For details, please see 
    apache.org
    File System Shell Guide
    .
  2. Using WebHDFS REST API for application integration. 
  3. Using Apache Flume. It is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS. Even though historically lots of use cases of Flume are involved with log data collection/aggregation, Flume can be used together with Kafka and turn itself into a real-time event processing pipeline.
  4. Using Storm,  a general-purpose, event-processing system. Within a topology composed of bolts and spouts, it can be used to ingest the event-based unstructured data into Hadoop
  5. Spark's streaming component offers another alternative to ingest real-time unstructured data into the HDFS. Its processing model is quite different from Storm though.  While Storm processes incoming event one at a time, Spark streaming actually batches up events that arrive within a short time window before processing them. It is called mini-batch. Spark streaming of course runs on top of Spark Core computing engine, which is claimed to be 100x faster than MapReduce in memory and 10x faster on disk. :-)

No comments: