Big data refers to the huge volume of data that cannotbe stored and processed with in a time frame intraditional file system.The next question comes in mind is how big this dataneeds to be in order to classify as a big data. There is alot of misconception in referring a term big data. Weusually refer a data to be big if its size is in gigabyte,terabyte, Petabyte or Exabyte or anything larger thanthis size. This does not define a big data completely.Even a small amount of file can be referred to as a bigdata depending upon the content is being used.Let’s just take an example to make it clear. If we attacha 100 MB file to an email, we cannot be able to do so.As a email does not support an attachment of this size.Therefore with respect to an email, this 100mb filecan be referred to as a big data. Similarly if we want toprocess 1 TB of data in a given time frame, we cannotdo this with a traditional system since the resourcewith it is not sufficient to accomplish this task.As you are aware of various social sites such asFacebook, twitter, Google+, LinkedIn or YouTubecontains data in huge amount. But as the users aregrowing on these social sites, the storing and processingthe enormous data is becoming a challenging task.Storing this data is important for various firms togenerate huge revenue which is not possible with atraditional file system. Here is what Hadoop comes inthe existence.Big Data simply means that huge amountof structured, unstructured and semi-structureddata that has the ability to be processed for information. Now a days massive amount of dataproduced because of growth in technology,digitalization and by a variety of sources, includingbusiness application transactions, videos, picture ,electronic mails, social media, and so on. So to processthese data the big data concept is introduced.Structured data: a data that does have a proper formatassociated to it known as structured data. For examplethe data stored in database files or data stored in excelsheets.Semi-Structured Data: A data that does not have aproper format associated to it known as structured data.For example the data stored in mail files or in docx.files.Unstructured data: a data that does not have any formatassociated to it known as structured data. For examplean image files, audio files and video files.Big data is categorized into 3 v’s associated with it thatare as follows:[1]Volume: It is the amount of data to be generated i.e.in a huge quantity.Velocity: It is the speed at which the data gettinggenerated.Variety: It refers to the different kind data which isgenerated.A. Challenges Faced by Big DataThere are two main challenges faced by big data [2]i. How to store and manage huge volume of dataefficiently.ii. How do we process and extract valuableinformation from huge volume data within a giventime frame.These main challenges lead to the development ofhadoop framework.Hadoop is an open source framework developed byduck cutting in 2006 and managed by the apachesoftware foundation. Hadoop was named after yellowtoy elephant.Hadoop was designed to store and process dataefficiently. Hadoop framework comprises of two maincomponents that are:i. HDFS: It stands for Hadoop distributed filesystem which takes care of storage of data withinhadoop cluster.ii. MAPREDUCE: it takes care of a processing of adata that is present in the HDFS.Now let’s just have a look on Hadoop cluster:Here in this there are two nodes that are Master Nodeand slave node.Master node is responsible for Name node and JobTracker demon. Here node is technical term used todenote machine present in the cluster and demon isthe technical term used to show the backgroundprocesses running on a Linux machine.The slave node on the other hand is responsible forrunning the data node and the task tracker demons.The name node and data node are responsible forstoring and managing the data and commonly referredto as storage node. Whereas the job tracker and tasktracker is responsible for processing and computing adata and commonly known as Compute node.Normally the name node and job tracker runs on asingle machine whereas a data node and task trackerruns on different machines.B. Features Of Hadoop:[3]i. Cost effective system: It does not require anyspecial hardware. It simply can be implementedin a common machine technically known ascommodity hardware.ii. Large cluster of nodes: A hadoop system cansupport a large number of nodes which providesa huge storage and processing system.iii. Parallel processing: a hadoop cluster provide theaccessibility to access and manage data parallelwhich saves a lot of time.iv. Distributed data: it takes care of splinting anddistributing of data across all nodes within a cluster.it also replicates the data over the entire cluster.v. Automatic failover management: once and AFMis configured on a cluster, the admin needs not toworry about the failed machine. Hadoop replicatesthe configuration Here one copy of each data iscopied or replicated to the node in the same rackand the hadoop take care of the internetworkingbetween two racks.vi. Data locality optimization: This is the mostpowerful thing of hadoop which make it the mostefficient feature. Here if a person requests for ahuge data which relies in some other place, themachine will sends the code of that data and thenother person compiles it and use it in particularas it saves a log to bandwidthvii. Heterogeneous cluster: node or machine can beof different vendor and can be working ondifferent flavor of operating systems.viii. Scalability: in hadoop adding a machine orremoving a machine does not effect on a cluster.Even the adding or removing the component ofmachine does not.C. Hadoop ArchitectureHadoop comprises of two componentsi. HDFSii. MAPREDUCEHadoop distributes big data in several chunks and storedata in several nodes within a cluster whichsignificantly reduces the time.Hadoop replicates each part of data into each machinethat are present within the cluster.The no. of copies replicated depends on the replicationfactor. By default the replication factor is 3. Thereforein this case there are 3 copies to each data on 3 differentmachines。reference:Mahajan, P., Gaba, G., & Chauhan, N. S. (2016). Big Data Security. IITM Journal of Management and IT, 7(1), 89-94.自己拿去翻译网站翻吧,不懂可以问