116 Slices
Medium 9781601323262

A Prediction-Based Approach for Collective I/O Optimization

Hamid R. Arabnia, Lou D'Alotto, Hiroshi Ishii, Minoru Ito, Kazuki Joe, Hiroaki Nishikawa, Georgios Sirakoulis, William Spataro, Giuseppe A. Trunfio, George A. Gravvanis, George Jandieri, Ashu M. G. Solo, Fernando G. Tinetti CSREA Press PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

121

A Prediction-Based Approach for Collective I/O

Optimization

Chaoqun Sha#, Hua Nie#, Huaiming Song*, Chenming Zheng#, Xiaojun Yang*, Chungjin Hu#

School of Computer and Communication Engineering, University of Science and Technology Beijing,

Beijing 100083, P.R.China

*

Dawning Information Industry Co., Ltd. Beijing 100193, P.R.China

#

Abstract - Computing is becoming increasingly data-centric.

I/O data access is identified as a critical performance bottleneck of end-to-end performance of high-end computing.

In this paper, we propose a lightweight approach to automatically identify and prevent harmful collective I/O specifically on MPI_IO_Read. In our approach, we first give an analytic model to analyze the performance of independent and collective I/O. Then, we design a mechanism to put our model in use. At last, we incorporate our fine-grained mechanism into the ROMIO MPI I/O library for performance testing. Experimental results show that the accuracy of our analytic model reaches about 92%. The proposed modelprediction mechanism is simple and practical, with a complexity of O(1). Analytical and experimental results confirm the practical usability of the proposed collective I/O improvement.

See All Chapters
Medium 9781601323262

Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost and Hyper-Threading

Hamid R. Arabnia, Lou D'Alotto, Hiroshi Ishii, Minoru Ito, Kazuki Joe, Hiroaki Nishikawa, Georgios Sirakoulis, William Spataro, Giuseppe A. Trunfio, George A. Gravvanis, George Jandieri, Ashu M. G. Solo, Fernando G. Tinetti CSREA Press PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

229

Task Scheduling Algorithm for

Multicore Processor Systems with Turbo Boost and Hyper-Threading

Yosuke Wakisaka, Naoki Shibata, Keiichi Yasumoto, Minoru Ito

Nara Institute of Science and Technology

Nara, Japan

{yosuke-w, n-sibata, yasumoto, ito}@is.naist.jp

Abstract—In this paper, we propose a task scheduling algorithm for multiprocessor systems with Turbo Boost and HyperThreading technologies. The proposed algorithm minimizes the total computation time taking account of dynamic changes of the processing speed by the two technologies, in addition to the network contention among the processors. We constructed a clock speed model with which the changes of processing speed with Turbo Boost and Hyper-threading can be estimated for various processor usage patterns. We then constructed a new scheduling algorithm that minimizes the total execution time of a task graph considering network contention and the two technologies. We evaluated the proposed algorithm by simulations and experiments with a multi-processor system consisting of 4 PCs. In the experiment, the proposed algorithm produced a schedule that reduces the total execution time by

See All Chapters
Medium 9781601323262

Method of Extracting Parallelization in Very Large Applications through Automated Tool and Iterative Manual Intervention

Hamid R. Arabnia, Lou D'Alotto, Hiroshi Ishii, Minoru Ito, Kazuki Joe, Hiroaki Nishikawa, Georgios Sirakoulis, William Spataro, Giuseppe A. Trunfio, George A. Gravvanis, George Jandieri, Ashu M. G. Solo, Fernando G. Tinetti CSREA Press PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

277

Method of Extracting Parallelization in Very Large

Applications through Automated Tool and Iterative

Manual Intervention

Smitha K.P, Aditi Sahasrabudhe, Vinay Vaidya

Center for Research in Engineering Sciences and Technology (CREST), KPIT Technologies, Pune, India

Abstract

Program parallelization involves multiple considerations.

These include methods for data or control parallelization, target architecture, and performance scalability. Due to number of such factors, best parallelization strategy for a given sequential application often evolves iteratively.

Researchers are confronted with choices of parallelization methods to achieve the best possible performance. In this paper, we share our experience in parallelizing a very large application (250K LOC) on shared memory processors. We iteratively parallelized the application by leveraging selective benefits from automatic as well as manual parallelization. We used YUCCA, an automatic parallelization tool, to generate parallelized code. Using the information generated by YUCCA, we improved the performance by modifying the parallelized code. This iterative process was continued until no further improvement was possible. We observed performance improvement of 17% compared to 5% improvement reported in the literature. The performance improvement was gained in very short time and despite the constraint of having to use only SMPs for parallelization.

See All Chapters
Medium 9781601323262

Multiple Precision Integer Multiplication on GPUs

Hamid R. Arabnia, Lou D'Alotto, Hiroshi Ishii, Minoru Ito, Kazuki Joe, Hiroaki Nishikawa, Georgios Sirakoulis, William Spataro, Giuseppe A. Trunfio, George A. Gravvanis, George Jandieri, Ashu M. G. Solo, Fernando G. Tinetti CSREA Press PDF

236

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

Multiple Precision Integer Multiplication on GPUs

Koji Kitano and Noriyuki Fujimoto

Graduate School of Science, Osaka Prefecture University, Sakai-shi, Osaka, Japan

Abstract— This paper addresses multiple precision integer multiplication on GPUs. In this paper, we propose a novel data-structure named a product digit table and present a GPU algorithm to perform the multiplication with the product digit table. Experimental results on a 3.10 GHz

Intel Core i3-2100 CPU and an NVIDIA GeForce GTX480

GPU show that the proposed GPU algorithm respectively runs over 71.4 times and 12.8 times faster than NTL library and GMP library, two of common libraries for single thread multiple precision arithmetic on CPUs. Another experiments show also that the proposed GPU algorithm is faster than the fastest existing GPU algorithm based on FFT multiplication if bit lengths of given two multiple precision integers are different.

Keywords: multiple precision integer, parallel multiplication,

See All Chapters
Medium 9781601323262

Study of Dynamically-Allocated Multi-Queue Buffers for NoC Routers

Hamid R. Arabnia, Lou D'Alotto, Hiroshi Ishii, Minoru Ito, Kazuki Joe, Hiroaki Nishikawa, Georgios Sirakoulis, William Spataro, Giuseppe A. Trunfio, George A. Gravvanis, George Jandieri, Ashu M. G. Solo, Fernando G. Tinetti CSREA Press PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

135

Study of Dynamically-Allocated Multi-Queue Buffers for NoC Routers

Yung-Chou Tsai

Yarsun Hsu

Department of Electrical Engineering

National Tsing Hua University

Hsinchu, Taiwan d923935@oz.nthu.edu.tw

Department of Electrical Engineering

National Tsing Hua University

Hsinchu, Taiwan yshsu@ee.nthu.edu.tw

Abstract—A large portion of area and power in Network-onChip (NoC) routers is consumed by buffers, and hence these costly storage resources must be utilized well. However, some early related literatures are not suitable for modern NoC router architecture as well as various complicated traffic loads anymore. In this work, we refine the dynamically-allocated multi-queue (DAMQ) buffer organization and propose a new one that can accommodate multiple packets more than the number of virtual channels, named DAMQ with multiple packets (DAMQ-MP). The DAMQ-MP scheme can solve certain data transmission issues under some circumstances, such as heavy network congestion or short packets, to improve performance. We also introduced two methods applicable to

See All Chapters

See All Slices