Hadoop distributed file system mechanism for processing of large datasets across computers cluster using programming techniques

Nicholas Jain Edwards; David Tonny Brain; Stephen Carinna Joly; Mariana Karry Masucato

doi:10.21744/irjmis.v6n6.739

Authors

Nicholas Jain Edwards
iwayansuryasa@gmail.com
University of Westminster, London, United Kingdom
David Tonny Brain University of Westminster, London, United Kingdom
Stephen Carinna Joly SOAS, University of London, London, United Kingdom
Mariana Karry Masucato University College London, London, United Kingdom

Keywords:

file, hadoop, memory, pipeline, system

Abstract

In this paper, we have proved that the HDFS I/O operations performance is getting increased by integrating the set associativity in the cache design and changing the pipeline topology using fully connected digraph network topology. In read operation, since there is huge number of locations (words) at cache compared to direct mapping the chances of miss ratio is very low, hence reducing the swapping of the data between main memory and cache memory. This is increasing the memory I/O operations performance. In Write operation instead of using the sequential pipeline we need to construct the fully connected graph using the data blocks listed from the NameNode metadata. In sequential pipeline, the data is getting copied to source node in the pipeline. Source node will copy the data to next data block in the pipeline. The same copy process will continue until the last data block in the pipeline. The acknowledgment process has to follow the same process from last block to source block. The time required to transfer the data to all the data blocks in the pipeline and the acknowledgment process is almost 2n times to data copy time from one data block to another data block (if the replication factor is n).

Downloads

References

Anuradha, J. (2015). A brief introduction on Big Data 5Vs characteristics and Hadoop technology. Procedia computer science, 48, 319-324. https://doi.org/10.1016/j.procs.2015.04.188

Bende, S., & Shedge, R. (2016). Dealing with small files problem in hadoop distributed file system. Procedia Computer Science, 79, 1001-1012. https://doi.org/10.1016/j.procs.2016.03.127

Cho, J. Y., Jin, H. W., Lee, M., & Schwan, K. (2014). Dynamic core affinity for high-performance file upload on Hadoop Distributed File System. Parallel Computing, 40(10), 722-737. https://doi.org/10.1016/j.parco.2014.07.005

Ghazi, M. R., & Gangodkar, D. (2015). Hadoop, MapReduce and HDFS: a developers perspective. Procedia Computer Science, 48, 45-50. https://doi.org/10.1016/j.procs.2015.04.108

Hua, X., Wu, H., Li, Z., & Ren, S. (2014). Enhancing throughput of the Hadoop Distributed File System for interaction-intensive tasks. Journal of Parallel and Distributed Computing, 74(8), 2770-2779. https://doi.org/10.1016/j.jpdc.2014.03.010

Jach, T., Magiera, E., & Froelich, W. (2015). Application of HADOOP to store and process big data gathered from an urban water distribution system. Procedia Engineering, 119, 1375-1380. https://doi.org/10.1016/j.proeng.2015.08.988

Lee, C. W., Hsieh, K. Y., Hsieh, S. Y., & Hsiao, H. C. (2014). A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Research, 1, 14-22. https://doi.org/10.1016/j.bdr.2014.07.002

Liu, K., & Dong, L. J. (2012). Research on cloud data storage technology and its architecture implementation. Procedia Engineering, 29, 133-137. https://doi.org/10.1016/j.proeng.2011.12.682

O’Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics, 46(5), 774-781. https://doi.org/10.1016/j.jbi.2013.07.001

Saraladevi, B., Pazhaniraja, N., Paul, P. V., Basha, M. S., & Dhavachelvan, P. (2015). Big Data and Hadoop-A study in security perspective. Procedia computer science, 50, 596-601. https://doi.org/10.1016/j.procs.2015.04.091

Saranya, S., Sarumathi, M., Swathi, B., Paul, P. V., Kumar, S. S., & Vengattaraman, T. (2015). Dynamic Preclusion of Encroachment in Hadoop Distributed File System. Procedia Computer Science, 50, 531-536. https://doi.org/10.1016/j.procs.2015.04.027

Uskenbayeva, R., Im Cho, Y., Temirbolatova, T., & Kozhamzharova, D. (2015). Integrating of data using the Hadoop and R. Procedia Computer Science, 56, 145-149. https://doi.org/10.1016/j.procs.2015.07.187

Uzunkaya, C., Ensari, T., & Kavurucu, Y. (2015). Hadoop ecosystem and its analysis on tweets. Procedia-Social and Behavioral Sciences, 195, 1890-1897. https://doi.org/10.1016/j.sbspro.2015.06.429

Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., & Chen, D. (2013). G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Generation Computer Systems, 29(3), 739-750. https://doi.org/10.1016/j.future.2012.09.001

Zhao, J., Wang, L., Tao, J., Chen, J., Sun, W., Ranjan, R., ... & Georgakopoulos, D. (2014). A security framework in G-Hadoop for big data computing across distributed Cloud data centres. Journal of Computer and System Sciences, 80(5), 994-1007. https://doi.org/10.1016/j.jcss.2014.02.006