Friday, January 29, 2016

Hadoop FAQ's

1)What are the hadoop configuration file at present?

2)What is jps command do?

3)What Is Default Replication Factor  In Haddop?

4)What Is Default Block Size Of Hadoop?

5)What Does The  Replication Factor Means?

6)What Is Zookeeper?

7)What are 'maps' and 'reduces'?

8)What are the four basic parameters of a mapper?

9)What are the four basic parameters of a reducer?

10)What does a MapReduce partitioner do?

11)What is the difference b/w name node and secondary name node?

12)What happens when a datanode fails?

13)Difference between order by and sort by in hive? 

14)What is bag?

15)What does FOREACH do?

16)What co-group does in Pig?

17)What are the types of tables in Hive

18)piglatin vs map reduce

19) how to load the data from local to hadoop

20)simple program of map

21)what are the parameters of 

22)DOES illustrate run MR job?

23) What is speculative execution?

24)diff b/w hbase and hdfs?

25)What happens when a datanode fails?
25)   Hadoop can run in how many modes? what are they?
26)   What are the hadoop configuration file at present?
27)   What is jps command do?
28)   What is SSH?
29)   What Is Default Replication Factor  In Haddop?
30)   What Is Default Block Size Of Hadoop?
31)   What Does The  Replication Factor Means?
32)   What Is Zookeeper?
33)   What Is Thrift In Hdfs?
34)   What are 'maps' and 'reduces'?
35)   What Is The Default Formate For Hadoop?
36)   What are the four basic parameters of a mapper?
37)   What are the four basic parameters of a reducer?
38)   What do the master class and the output class do?
39)   What is the input type/format in MapReduce by default?
40)   What is the input type/format in MapReduce by default?
41)   What does a MapReduce partitioner do?
42)   What is Streaming?
43)   What is the difference b/w name node and secondary name node?
44)   Which operating system(s) are supported for production Hadoop deployment?
45)   Explain what does the conf.setMapper Class do ?
46)   What is speculative execution?
47)   If you have a file 128M size and replication factor is set to 3, how many blocks can you find on the cluster that will correspond to that file (assuming the default apache and cloudera configuration)?
48)   What is rack awareness?
49)   What happens when a datanode fails?
50)   What is fault tolerance in HDFS?
51)   What Is Disk Block Size In Hdfs?
52)   Does acid support work in hive?
53)   Difference b/w hive 0.13 and hive 0.14
54)   To run update and delete commands does zookeeper required?
55)   From which version acid support fully implemented?
56)   Does Hive support record level Insert, delete or update? 
57)   What kind of datawarehouse application is suitable for Hive?
58)   TRIM function in Hive with example?
59)   REVERSE function in Hive with example?
60)   How to change a column data type in Hive?
61)   Difference between order by and sort by in hive? 
62)   What are the types of tables in Hive?
63)   What is bag?
64)   What does FOREACH do?
65)   What co-group does in Pig?
66)   DOES illustrate run MR job?
67)   Is he keyword FUNCTIONAL a user define function?
68)   What is a cell in Hbase?
69)   what is HMaster?
70)   Hbase can install in how many ways?
71)   Hbase support ACID property?
72)  diff b/w hbase and hdfs?
73)   In Hbase what is column families?
74)   What is the role of Master server in Hbase?



Descriptive::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


23. What is a SequenceFile? | Hadoop
A. ASequenceFilecontains a binaryencoding ofan arbitrary numberof homogeneous writable objects.
B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous writable objects.
C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. ASequenceFilecontains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be sametype.

24.Is there a map input format? | Hadoop
A.  Yes, but only in Hadoop 0.22+.
B.  Yes, there is a special format for map files.
C.  No, but sequence file input format can read map files.
D.  Both 2 and 3 are correct answers.

25. In a MapReduce job, you want each of you input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?
A. Increase the parameter that controls minimum split size in the job configuration.
B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C. Set the number of mappers equal to the number of input files you want to process.
D. Write a custom FileInputFormat and override the method isSplittable to always return false.

26. Which of the following best describes the workings of TextInputFormat? | Hadoop
A. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.
B. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.
C. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the brokenline.
D. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the brokenline.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginningof thebroken line.

27. Which of the following statements most accurately describes the relationship between MapReduce and Pig?
A. Pig provides additional capabilities that allow certain types of data manipulation not possible with MapReduce.
B. Pig provides no additional capabilities to MapReduce. Pig programs are executed as MapReduce jobs via the Pig interpreter.
C. Pig programs rely on MapReduce but are extensible, allowing developers to do special-purpose processing not provided by MapReduce.
D. Pig provides the additional capability of allowing you to control the flow of multiple MapReduce jobs.

28.You need to import a portion of a relational database every day as files to HDFS, and generate Java classes to Interact with your imported data. Which of the following tools should you use to accomplish this?
A. Pig
B. Hue
C. Hive
D. Flume
E. Sqoop
F. Oozie
G. fuse-dfs

29.The Combine stage, if present, must perform the same aggregation operation as Reduce. | Hadoop
A. True
B. False 

30.How can a distributed filesystem such as HDFS provide opportunities for optimization of a MapReduce operation? | Hadoop
A. Data represented in a distributed filesystem is already sorted.
B. Distributed filesystems must always be resident in memory, which is much faster than disk.
C. Data storage and processing can be co-located on the same node, so that most input data relevant to Map or Reduce will be present on local disks or cache.
D. A distributed filesystem makes random access faster because of the presence of a dedicated node serving file metadata. 

31.What is AVRO? | Hadoop
A.  Avro is a java serialization library 
B.  Avro is a java compression library 
C.  Avro is a java library that create splittable files 
D.  None of these answers are correct

32.What is distributed cache? | Hadoop
A.  The distributed cache is special component on namenode that will cache frequently used data for faster client response. It is used during reduce step. 
B.  The distributed cache is special component on datanode that will cache frequently used data for faster client response. It is used during map step. 
C.  The distributed cache is a component that caches java objects. 
D.  The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.

33.Will settings using Java API overwrite values in configuration files? 
A.  No. The configuration settings in the configuration file takes precedence 
B.  Yes. The configuration settings using Java API take precedence 
C.  It depends when the developer reads the configuration file. If it is read first then no. 
D.  Only global configuration settings are captured in configuration files on namenode. There are only a very few job parameters that can be set using Java API.

34.How can you disable the reduce step? | Hadoop
A.  The Hadoop administrator has to set the number of the reducer slot to zero on all slave nodes. This will disable the reduce step. 
B.  It is imposible to disable the reduce step since it is critical part of the Mep-Reduce abstraction. 
C.  A developer can always set the number of the reducers to zero. That will completely disable the reduce step. 
D.  While you cannot completely disable reducers you can set output to one. There needs to be at least one reduce step in Map-Reduce abstraction.

35.How can you use binary data in MapReduce? | Hadoop
A.  Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file. 
B.  Binary data cannot be used by Hadoop fremework. Binary data should be converted to a Hadoop compatible format prior to loading. 
C.  Binary can be used in map-reduce only with very limited functionlity. It cannot be used as a key for example. 
D.  Hadoop can freely use binary files with map-reduce jobs so long as the files have headers

36.How many states does Writable interface defines _____. 
A. Two
B. Four
C. Three
D. None of the above

37.Which of the following scenarios makes HDFS unavailable? 
A. JobTracker failure
B. TaskTracker failure
C. DataNode failure
D. NameNode failure
E. Secondary NameNode failure

38.When archiving Hadoop files, which of the following statements are true? 
A.Archived files will display with the extension .arc
B.Many small files will become fewer large files
C.MapReduce processes the original files names even after files are archived
D.Archived files must be unarchived for HDFS and MapReduce to access the original, small files
E.Archive is intended for files that need to be saved but no longer accessed by HDFS

39.What is the data type of the return value of the getPartition method in the org.apache.hadoop.mapred.Partitioner interface?
A. InputSplit
B. int
C. String
D. Partition

40. What happens when the io.sort.spill.percent threshold is exceeded when a Mapper is outputting <key,value> pairs?
A. An exception is thrown and the map task is marked as failed.
B. The JobTracker automatically increases the number of reducers.
C. The buffered records are spilled to disk
D. Records output after this threshold is reached are no longer sorted.

41. What happens if mapper output does not match reducer input? 
A.  Hadoop API will convert the data to the type that is needed by the reducer. 
B.  Data input/output inconsistency cannot occur. A preliminary validation check is executed prior to the full execution of the job to ensure there is consistency. 
C.  The java compiler will report an error during compilation but the job will complete with exceptions. 
D.  A real-time exception will be thrown and map-reduce job will fail.

42.You are running a Hadoop cluster with all monitoring facilities properly configured. Which scenario will go undetected.? | Hadoop
A. Map or reduce tasks that are stuck in an infinite loop.
B. HDFS is almost full.
C. The NameNode goes down.
D. A DataNode is disconnectedfrom the cluster.
E. MapReduce jobs that are causing excessive memory swaps.

43.Which of the following utilities allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?
A. Oozie
B. Sqoop
C. Flume
D. Hadoop Streaming

44.What is map - side join? | Hadoop
A.  Map-side join is done in the map phase and done in memory 
B.  Map-side join is a technique in which data is eliminated at the map step 
C.  Map-side join is a form of map-reduce API which joins data from different locations 
D.  None of these answers are correct

45.What are the common problems with map-side join? 
A.  The most common problem with map-side joins is introducing a high level of code complexity. This complexity has several downsides: increased risk of bugs and performance degradation. Developers are cautioned to rarely use map-side joins. 
B.  The most common problem with map-side joins is lack of the avaialble map slots since map-side joins require a lot of mappers. 
C.  The most common problems with map-side joins are out of memory exceptions on slave nodes. 
D.  The most common problem with map-side join is not clearly specifying primary index in the join. This can lead to very slow performance on large datasets.

46.The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?
A.  Writable data types are specifically optimized for network transmissions 
B.  Writable data types are specifically optimized for file system storage 
C.  Writable data types are specifically optimized for map-reduce processing 
D.  Writable data types are specifically optimized for data retrieval

47.You have an employee who is a Date Analyst and is very comfortable with SQL. He would like to run ad-hoc analysis on data in your HDFS duster. Which of the following is a data warehousing software built on top of Apache Hadoop that defines a simple SQL-like query language well-suited for this kind of user?
A. Pig
B. Hue
C. Hive
D. Sqoop
E. Oozie
F. Flume
G. Hadoop Streaming

48.You need a distributed, scalable, data Store that allows you random, realtime read/write access to hundreds of terabytes of data. Which of the following would you use? | Hadoop
A. Hue
B. Pig
C. Hive
D. Oozie
E. HBase
F. Flume
G. Sqoop

49.What is reduce - side join? | Hadoop
A.  Reduce-side join is a technique to eliminate data from initial data set at reduce step 
B.  Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions 
C.  Reduce-side join is a set of API to merge data from different sources. 
D.  None of these answers are correct

50.Which is faster: Map-side join or Reduce-side join? Why?
A.  Both techniques have about the the same performance expectations. 
B.  Reduce-side join because join operation is done on HDFS. 
C.  Map-side join is faster because join operation is done in memory. 
D.  Reduce-side join because it is executed on a the namenode which will have faster CPU and more memory.

MapReduce Job Execution

Job Submition > Job Initialization >  Task Assignment > Task Execution

JobSubmitter is started
1. Output Dir exist or not
2. Input file exist or not
3. It will create a JobT racker Id
4. It will create a directory with the name Job Tracker Id and this is dir is created on job tracker machince.
5. It will upload the job related jar/xml file on to this directory.
6. It will create multiple copies of job.jar files and place in different locations of HDFS along with shared directory 10
7. It will compute the no.of splits on the blocks that is submitted by Namenode

Job Initialization
1. Job is nothing but a Map Task and a Reduce Task
2. JT is creating 4 tasks
3. 4 Tasks : 
4. No. Of MapTasks= No.Of splits
5. No. Of ReduceTasks= configurable
6. Setup Task
7. Cleanup Task
8. The jobs are placed in Job Queue.
9. This jobs are consumed by Schedulers
10. setNumReduceTask(3)
11. FIFO scheduler, Fair Scheduler, Capacity Scheduler

Task Assignment
1. Now the Scheduler pick a Job and assign to job tracker
2. Now the JT have to which TTs are available.
3. TT responsibility to tell JT that i am available based on HeartBeat.
4. In HeartBeat the TT will send one more info i.e whether it is ready to take new task or not.
5. Based on this info the JT will assign a new task to the tasktracker.
6. The TT will know that a new task is assigned based on the return value of HeartBeat.

Task Execution
7. Setup Task: It will create a tmp dir local to the TT
it will pull the jar files from the Shared directory of JobTracker
8. Setup Task create OutputCommiters
9.Based on OutputCOmmiter the Map Task will know after executing map task where should my output to go
9. Reduce task will use OutputCommiter to check where my output should go.
10. Map Task start execution and produce output and the output consumed by Outputt Commiter.
11. Reduce Task start execution and produce output and the output consumed by Outputt Commiter.
12. Cleanup task executes and removes tmp directory, any local data is also removed.


Task Execution:

MapTask/Reducer Task : Unzip the jars and copy the files into Memory. It will start executing Map Task Reduce Task, it even use local files for reading(shared directory JT)

JT will pull the progess info from the Map Task/Reduce Task this info is fed to the Namenode from which the NN will gives this information to client.
JT will copy this progress info in shared directory

Friday, December 13, 2013

Challenges Of Big Data

Unstructured information is growing :

budget :
managing big data can take huge resources

Data complications :
lack of data mobility,proprietary formats can make working with big data complicated.

expensive:
ETL processes for Big Data can be expensive as there is no specialized software.