Hadoop

{
"iq":[{"id":"1",
"q":"Explain what is Hadoop?",
"answer":"It is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides enormous processing power and massive storage for any type of data. "
},
{"id":"2",
"q":"what is the difference between an RDBMS and Hadoop?",
"answer":"RDBMS\n\nRDBMS is relational database management system\n\nIt used for OLTP processing whereas Hadoop In RDBMS, the database cluster uses the same data files stored in shared storage\n\nYou need to preprocess data before storing it\n\n\nHadoop\n\nHadoop is node based flat structure\n\nIt is currently used for analytical and for BIG DATA processing\n\nIn Hadoop, the storage data can be stored independently in each processing node.\n\nyou don’t need to preprocess data before storing it"
},
{"id":"3",
"q":"Mention Hadoop core components?",
"answer":"Hadoop core components include,\n\nHDFS\n\nMapReduce"
},

{"id":"4",
"q":"Name some companies that use Hadoop.",

"answer":"Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)\n\nFacebook\n\nNetflix\n\nAmazon\n\nAdobe\n\neBay\n\nHulu\n\nSpotify\n\nRubikloud\n\nTwitter "
},
{"id":"5",
"q":"What do the four V’s of Big Data denote?",
"answer":"IBM has a nice, simple explanation for the four critical features of big data:\n\na) Volume –Scale of data\n\nb) Velocity –Analysis of streaming data\n\nc) Variety – Different forms of data\n\nd) Veracity –Uncertainty of data"
},
{"id":"6",
"q":"Differentiate between Structured and Unstructured data.? ",
"answer":"Data which can be stored in traditional database systems in the form of rows and columns,for example the online purchase transactions can be referred to as Structured Data.\n\nData which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data.\n\nUnorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data.\n\nFacebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data. "
},

{"id":"7",
"q":"What is Hadoop streaming?",
"answer":"Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers. "
},

{"id":"8",
"q":"What are the Features of Hadoop?",
"answer":"Open Source\n\nDistributed processing\n\nFault Tolerance\n\nReliability\n\nHigh Availability\n\nScalability\n\nEconomic\n\nEasy to use "
},
{"id":"9",
"q":"What are the modes in which Hadoop run?",

"answer":"Apache Hadoop runs in three modes:\n\nLocal (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. It is also used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for configuration files.\n\nPseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.\n\nFully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave."
},

{"id":"10",
"q":"What is NameNode in Hadoop?",
"answer":"NameNode in Hadoop is where Hadoop stores all the file location information in HDFS.It is the master node on which job tracker runs and consists of metadata."
},

{"id":"11",
"q":"Mention what are the data components used by Hadoop?",

"answer":"Data components used by Hadoop are\n\nPig\n\nHive"
},

{"id":"12",
"q":"Mention what is the data storage component used by Hadoop?",
"answer":"The data storage component used by Hadoop is HBase."
},
{"id":"13",
"q":"In Hadoop what is InputSplit?",
"answer":"It splits input files into chunks and assign each split to a mapper for processing."
},

{"id":"14",
"q":"For a job in Hadoop, is it possible to change the number of mappers to be created?",
"answer":"No, it is not possible to change the number of mappers to be created. The number of mappers is determined by the number of input splits."
},

{"id":"15",
"q":"What are the limitations of Hadoop? ",
"answer":"1 Issue with small files\n\nProcessing Speed\n\nSupport only Batch Processing\n\nIterative Processing\n\nVulnerable by nature\n\nSecurity"
},

{"id":"16",
"q":"What is the problem with small files in Hadoop? ",
"answer":"Hadoop is not suited for small data. Hadoop HDFS lacks the ability to support the random reading of small files.\n\nSmall file in HDFS is smaller than the HDFS block size (default 128 MB).\n\nIf we are storing these huge numbers of small files, HDFS can’t handle these lots of files.\n\nHDFS works with the small number of large files for storing large datasets.\n\nIt is not suitable for a large number of small files.\n\nA large number of many small files overload NameNode since it stores the namespace of HDFS. "
},

{"id":"17",
"q":"How is security achieved in Hadoop?",
"answer":"Apache Hadoop achieves security by using Kerberos.\n\nAt a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.\n\nAuthentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).\n\nAuthorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.\n\nService Request – The client uses the service ticket to authenticate itself to the server. "
},
{"id":"18",
"q":"Why does one remove or add nodes in a Hadoop cluster frequently? ",
"answer":"The most important features of the Hadoop is its utilization of Commodity hardware. However, this leads to frequent Datanode crashes in a Hadoop cluster.\n\nAnother striking feature of Hadoop is the ease of scale by the rapid growth in data volume. Hence, due to above reasons, administrator Add/Remove DataNodes in a Hadoop Cluster. "
},

{"id":"19",
"q":"What does jps command do in Hadoop? ",
"answer":"The jbs command helps us to check if the Hadoop daemons are running or not. Thus, it shows all the Hadoop daemons that are running on the machine. Daemons are Namenode, Datanode, ResourceManager, NodeManager etc. "
},

{"id":"20",
"q":"How to restart NameNode or all the daemons in Hadoop?",
"answer":"You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command. Then start the NameNode using /sbin/hadoop-daemon.sh start namenode.\n\nUse /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first. Then start all the daemons.\n\nThe sbin directory inside the Hadoop directory store these script files."
},
{"id":"21",
"q":"How to debug Hadoop code?",
"answer":"First, check the list of MapReduce jobs currently running. Then, check whether orphaned jobs is running or not; if yes, you need to determine the location of RM logs.\n\nFirst of all, Run: “ps –ef| grep –I ResourceManager” and then, look for log directory in the displayed result. Find out the job-id from the displayed list. Then check whether error message associated with that job or not.\n\nNow, on the basis of RM logs, identify the worker node which involves in the execution of the task.\n\nNow, login to that node and run- “ps –ef| grep –I NodeManager”\n\nExamine the NodeManager log.\n\nThe majority of errors come from user level logs for each amp-reduce job. "
},
{"id":"22",
"q":"Explain what is a sequence file in Hadoop? ",
"answer":"To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed. "
},

{"id":"23",
"q":"When Namenode is down what happens to job tracker?",
"answer":"Namenode is the single point of failure in HDFS so when Namenode is down your cluster will set off."
},{"id":"24",
"q":"Explain is it possible to search for files using wildcards? ",
"answer":"Yes, it is possible to search for files using wildcards."
},{"id":"25",
"q":"Explain what is “map” and what is \"reducer\" in Hadoop?",
"answer":"In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.\n\nIn Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own."
},{"id":"26",
"q":"In Hadoop, which file controls reporting in Hadoop?",
"answer":"In Hadoop, the hadoop-metrics.properties file controls reporting."
},{"id":"27",
"q":"Mention what is rack awareness?",
"answer":"Rack awareness is the way in which the namenode determines on how to place blocks based on the rack definitions."
},{"id":"28",
"q":"Explain what is a Task Tracker in Hadoop? ",
"answer":"A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker. It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive."
},{"id":"29",
"q":"Mention what daemons run on a master node and slave nodes?",
"answer":"Daemons run on Master node is \"NameNode\".Daemons run on each Slave nodes are “Task Tracker” and \"Data\""
},{"id":"30",
"q":"Explain how can you debug Hadoop code? ",
"answer":"The popular methods for debugging Hadoop code are:\n\nBy using web interface provided by Hadoop framework\n\nBy using Counters"
},{"id":"31",
"q":"What is the purpose of “RecordReader” in Hadoop? ",
"answer":"The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”."
},{"id":"32",
"q":"How do “reducers” communicate with each other? ",
"answer":"This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation. "
},{"id":"33",
"q":"How will you write a custom partitioner?",
"answer":"Custom partitioner for a Hadoop job can be written easily by following the below steps:\n\nCreate a new class that extends Partitioner Class\n\nOverride method – getPartition, in the wrapper that runs in the MapReduce.\n\nAdd the custom partitioner to the job by using method set Partitioner or add the custom partitioner to the job as a config file. "
},{"id":"34",
"q":"What is a “Combiner”?",
"answer":"A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”. "
},{"id":"35",
"q":"What are the different relational operations in “Pig Latin” you worked with?",
"answer":"for each\n\norder by\n\nfilters\n\ngroup\n\ndistinct\n\njoin\n\nlimit"
},{"id":"36",
"q":"What is a UDF?",
"answer":"If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file. "
},{"id":"37",
"q":"Explain Hadoop Archives?",
"answer":"Apache Hadoop HDFS stores and processes large (terabytes) data sets. However, storing a large number of small files in HDFS is inefficient, since each file is stored in a block, and block metadata is held in memory by the namenode.\n\nReading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is inefficient data access pattern.\n\nHadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small files into a large file, so, one can access the original files in parallel transparently (without expanding the files) and efficiently.\n\nHadoop Archives are special format archives. It maps to a file system directory. Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce uses Hadoop Archives as an Input."
},{"id":"38",
"q":"How would you check whether your NameNode is working or not?",
"answer":"There are several ways to check the status of the NameNode. Mostly, one uses the jps command to check the status of all daemons running in the HDFS."
},{"id":"39",
"q":"What is the key- value pair in MapReduce? ",
"answer":"Hadoop MapReduce implements a data model, which represents data as key-value pairs. Both input and output to MapReduce Framework should be in Key-value pairs only.\n\nIn Hadoop, if the schema is static we can directly work on the column instead of key-value. But, the schema is not static we will work on keys and values. Keys and values are not the intrinsic properties of the data. But the user analyzing the data chooses a key-value pair. A Key-value pair in Hadoop MapReduce generate in following way:\n\n\nInputSplit- It is the logical representation of data. InputSplit represents the data which individual Mapper will process.\n\nRecordReader- It communicates with the InputSplit (created by InputFormat). And converts the split into records. Records are in form of Key-value pairs that are suitable for reading by the mapper. By Default RecordReader uses TextInputFormat for converting data into a key-value pair.\n\nKey- It is the byte offset of the beginning of the line within the file, so it will be unique if combined with the file.\n\nValue- It is the contents of the line, excluding line terminators. For Example file content is- on the top of the crumpetty Tree\n\nKey- 0\n\nValue- on the top of the crumpetty Tree"
},{"id":"40",
"q":"Mention what is the use of Context Object? ",
"answer":"The Context Object enables the mapper to interact with the rest of the Hadoop\n\nsystem. It includes configuration data for the job, as well as interfaces which allow it to emit output."
},{"id":"41",
"q":"Mention what is the next step after Mapper or MapTask? ",
"answer":"The next step after Mapper or MapTask is that the output of the Mapper are sorted, and partitions will be created for the output."
},{"id":"42",
"q":"Mention what is the number of default partitioner in Hadoop? ",
"answer":"In Hadoop, the default partitioner is a “Hash” Partitioner."
},{"id":"43",
"q":"What happens to a NameNode that has no data?·",
"answer":"There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it."
},{"id":"44",
"q":"What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.",
"answer":"The Hadoop job fails when the NameNode is down."
},{"id":"45",
"q":"Whenever a client submits a hadoop job, who receives it?",
"answer":"NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion."
},{"id":"46",
"q":"What are the different operational commands in HBase at record level and table level?",
"answer":"Record Level Operational Commands in HBase are –put, get, increment, scan and delete.\n\nTable Level Operational Commands in HBase are-describe, list, drop, disable and scan. "
},{"id":"47",
"q":"What is Row Key?",
"answer":"Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array."
},{"id":"48",
"q":"Explain about the different catalog tables in HBase? ",
"answer":"The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system. "
},{"id":"49",
"q":"Explain about HLog and WAL in HBase. ",
"answer":"All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush. "
},{"id":"50",
"q":"Mention what is distributed cache in Hadoop? ",
"answer":"Distributed cache in Hadoop is a facility provided by MapReduce framework. At the time of execution of the job, it is used to cache file. The Framework copies the necessary files to the slave node before the execution of any task at that node."
}
]

}

Share this