Medium 9781601323262

Parallel and Distributed Processing Techniques and Applications

Views: 2995
Ratings: (0)

PDPTA is an international conference that serves researchers, scholars, professionals, students, and academicians who are looking to both foster working relationships and gain access to the latest research results. It is being held jointly (same location and dates) with a number of other research conferences; namely, The 2014 World Congress in Computer Science, Computer Engineering, and Applied Computing (WORLDCOMP'14). The Congress is among the top five largest annual gathering of researchers in computer science, computer engineering and applied computing. We anticipate to have attendees from about 85 countries/territories. The 2014 Congress will be composed of research presentations, keynote lectures, invited presentations, tutorials, panel discussions, and poster presentations.

List price: $99.95

Your Price: $79.96

You Save: 20%

Remix
Remove
 

116 Slices

Format Buy Remix

Session - Parallel Computing and Algorithms + Multi-Core and Energy-Aware Computing

PDF

 

Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

3

Modeling the Effects on Power and Performance from Memory Interference of

Co-located Applications in Multicore Systems

Daniel Dauwe1 , Ryan Friese1 ,

Sudeep Pasricha1,2 , Anthony A. Maciejewski1 , Gregory A. Koenig3 , and Howard Jay Siegel1,2

1

Department of Electrical and Computer Engineering

2

Department of Computer Science

Colorado State University

Fort Collins, CO, 80523

3

Oak Ridge National Laboratory

Oak Ridge, TN 37830

Abstract— In this study, we analyze interference trends when corunning multiple applications possessing varying degrees of memory intensity on multi-core processors. We conduct tests with PARSEC benchmark applications and explore energy consumption, execution times, and main memory accesses when interfering applications share last-level cache. We also explore how co-running applications are impacted when the processor frequency is modified using dynamic voltage and frequency scaling (DVFS). A portable and lightweight testing framework is presented and results are shown for experiments conducted on an Intel i7 quad-core system. It is shown that the degree of degradation due to co-location interference on execution time is highly dependent on the types and number of applications co-located on cores that share the last-level cache.

 

Processing NOAA Observation Data over Hybrid Computer Systems for Comparative Climate Change Analysis

PDF

10

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

Processing NOAA Observation Data over Hybrid

Computer Systems for Comparative Climate Change

Analysis

Xuan Shi1,, Dali Wang 2

Department of Geosciences, University of Arkansas, Fayetteville, AR 72701, USA

2

Environmental Science Division, Oak Ridge National Lab Oak Ridge TN 37831, USA

1

Abstract - With the rapid development of weather monitoring system, numerous observational data are available. For example, NOAA provides Global Surface Summary of Day

(GSOD) data that incorporates daily weather measurements from over 9000 weather stations around the world. In this paper, a comprehensive workflow and methodology is presented to elaborate how to transform GSOD data into a new and useful format so as to generate interpolated product of daily, monthly and annual mean surface temperature datasets by using advanced computation platforms. The quality of this gridded, high resolution (at ¼ degree) daily product is further examined by comparing to an existing global climate dataset. A preliminary comparison on global surface temperature shows a consistent agreement between these two datasets, with the major differences located in a few regions. The interpolated GSOD data products are supplementary to existing datasets by providing new gridded, high resolution observation-based daily temperature information over three decades (1982-2011), which are very useful for decadal climate change researches.

 

High Efficiency Video Decoding on Multicore Processor

PDF

16

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

High Efficiency Video Decoding on Multicore Processor

Hyeonggeon Lee1, Jong Kang Park2, and Jong Tae Kim1,2

Department of IT Convergence1

Sungkyunkwan University

Suwon, Korea

Department of Electrical and Computer Engineering2

Sungkyunkwan University

Suwon, Korea

Abstract—In this paper we present a High Efficiency

Video Coding(HEVC) decoder implemented using multicore processor. HEVC can support Ultra High

Definition (UHD) digital TV and resolution up to

8192x4320. It aims to achieve compression rate in the range of 50% bit-rate relative to existing standards. And decoding speed should be over the 30 fps(frames per second). Although multicore processors have sufficient performance and enough memory, HEVC software cannot make use it very well. So not only the proper architecture for HEVC is needed but also modified HEVC software should be considered. Gem5 simulator was used to simulate the performance of our decoder. Base on the simulation results we suggest appropriate multicore architecture platform which can satisfy the goal of HEVC decoding speed.

 

Design of an In-Memory Database Engine Using Intel Xeon Phi Coprocessors

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

21

Design of an In-Memory Database Engine

Using Intel Xeon Phi Coprocessors

Michael Scherger

Department of Computer Science

Texas Christian University

Fort Worth, TX, USA

Email: m.scherger@tcu.edu

Abstract – (PDPTA’14) This research presents the design and initial implementation of a database engine using an Intel Xeon Phi co-processor. The many integrated cores (MIC) of the Xeon Phi make this hardware accelerator a natural computing platform for an in-memory database engine or server. The database tables reside in the memory space of the MIC thus supporting fast in-memory database applications. This achieved by developing a coalescing parallel memory manager to allocate parallel variables in the same manner that fields are created in a table using a SQL

CREATE TABLE command. The SQL interface was created using a database driver toolkit that provides an interface to the Xeon Phi server and client application.

Once the basic framework was established, the algorithms for SQL select, insert, update, delete, and join were created to manipulate database information in the memory of the Xeon Phi.

 

Accelerating Medical Image Registration Using a SIMD Array

PDF

28

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

Accelerating Medical Image Registration Using a SIMD

Array

I. K. Jeong1, M. S. Kang1, C. H. Kim2 and J. M. Kim1,*

School of Electrical Engineering, University of Ulsan, Ulsan, South Korea

2

School of Electronics and Computer Engineering, Chonnam National University, Gwangju, South Korea

1

Abstract - Medical image registration plays an important role in medical imaging in the early detection of cancers. An essential component in most medical registration approaches is resampling algorithms. These algorithms, however, demand tremendous computational power associated with similarity computation. The increasing availability of parallel computers makes parallelizing these tasks an attractive option. This paper presents parallel approaches for the resampling algorithms using a representative parallel Single Instruction,

Multiple Data (SIMD) processor array to meet the computational requirements. This paper also presents not only a general theory of resampling algorithms including rotation, scaling, and translation, but also parallel implementations of these algorithms on the SIMD processor array. Experimental results show that parallel approaches achieve a speedup of

 

Processing Hard Sphere Collisions on a GPU Using OpenCL

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

35

Processing Hard Sphere Collisions on a GPU Using OpenCL

Zachary Langbert1 and Mark C. Lewis1

1 Computer Science, Trinity University, San Antonio, TX, USA

Abstract— Physically accurate hard sphere collisions are inherently sequential as the order in which collisions occur can have a significant impact on the resulting system. This makes processing hard sphere collisions on parallel hardware challenging. We present an approach to solving this problem that can be implemented using OpenCL that runs on current hardware. This approach makes significant use of atomic operations to prevent race conditions, even across thread groups. We find that an unoptimized implementation of the approach provides speed on modest GPUs that is on par with our earlier OpenMP parallel CPU approach and the OpenCL running on a CPU is faster than the OpenMP code. Full timing results using commodity GPU and using

OpenCL on multi-core chips are presented.

Keywords: Simulation, collisions, parallel, discrete-event, GPU

 

Scalable Self-Tuning Implementation of Smith-Waterman Algorithm for Multicore CPUs

PDF

42

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

Scalable Self-Tuning Implementation of

Smith-Waterman Algorithm for Multicore CPUs

Faisal Sikder and Dilip Sarkar

Keywords-CPU Oblivious, Scalable Algorithm, Multicore CPU,

OpenMP, Parallel algorithm, Shared memory architecture,

Smith-Waterman algorithm.

Abstract—Improved version of the Smith-Waterman algorithm

(SWA) is most widely used for local alignment of a pattern (or query) sequence with a Database (DB) sequence. This dynamicprogramming algorithm is computation intensive. To reduce time for computing alignment score matrix, parallel versions have been implemented on GPUs and multicore CPUs. These parallel versions have shown significant speedup when compared with their corresponding sequential versions.

Our initial evaluation of an OpenMP parallelization of SWA has shown linear speedup on multicore CPUs, but a closer look at performance data from both sequential and parallel versions have revealed two undesired effects: (i) As the length of the

 

Low-power Task Scheduling in Homogeneous Quad-core Processor

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

49

Low-power Task Scheduling in

Homogeneous Quad-core Processor

Kyuhan Kim1, Myungha Kim2, Hyeonggeon Lee1, Jong Kang Park2, and Jong Tae Kim1,2

Department of IT Convergence1

Sungkyunkwan University

Suwon, Korea

Department of Electrical and Computer Engineering2

Sungkyunkwan University

Suwon, Korea jtkim@skku.edu

Abstract— In this paper, we consider the low-energy task scheduling algorithm in quad-core environment. As adding the factor of energy consumption in the DAG, we can consider the factor of energy consumption in algorithm level, and we propose the low-energy task clustering and scheduling algorithm. As a result, we can improve energy efficiency for the task scheduling in the DAG with time constraint. We focus on reducing the core idle state energy consumption and propose the modified task clustering method for lowering energy consumption for idle stats core execution.

Keywords: DAG, Task clustering, Task scheduling

1

Introduction

As adopted multi-core processing techniques, the needs have been satisfied with the performance, but operating a large number of cores such as dual-core and quad-core brings about some issues related to low power design like power consumption and long battery life, and so on. In responding to this trend, we are concerned some critical issues for the task scheduling to promote energy efficiency in multi-core processor environment [1].

 

An SIMD Architecture for Shortest-Path Search and Its FPGA Implementation

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

53

An SIMD Architecture for Shortest-Path Search and Its FPGA

Implementation

Yasuhiro Takei, Masanori Hariyama and Michitaka Kameyama

Graduate School of Information Sciences, Tohoku University

Aoba 6-6-05, Aramaki, Aoba, Sendai, Miyagi, 980-8579, Japan

Email: {takei, hariyama, kameyama}@ecei.tohoku.ac.jp

Abstract— Shortest-path search over graphs plays an important role in various applications. However, shortest path algorithms such as the Dijkstra’s algorithm include complex processings. It is difficult for accelerators with fixed-datapath such as GPUs to accelerate these algorithms efficiently. This paper presents an

FPGA-based accelerator with a SIMD architecture for the shortest-paths algorithm. In the proposed architecture, operations in the Dijkstra’s algorithm are done with a high degree of parallelism, and the memory usage is reduced by using a memory management scheme. According to the evaluation, the proposed architecture is able to deal with graphs with more than

 

Parallel Fuzzy Filter for Impulse Noise Removal

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

57

Parallel Fuzzy Filter for Impulse Noise Removal

J. Arnal1 , L. A. Drummond2 , L. B. Súcar1 , and V. Vidal3

1 Departamento de Ciencia de la Computación e Inteligencia Artificial,

Universidad de Alicante, San Vicente del Raspeig, Alicante, Spain

2 Lawrence Berkeley National Laboratory, Berkeley, CA, USA

3 Departamento de Sistemas Informáticos y Computación,

Universidad Politécnica de Valencia, Grao de Gandia, Valencia, Spain

Abstract— A parallel algorithm for reducing impulse noise from color images is proposed. The algorithm is based on a fuzzy metric and is performed in two steps followed by noise

filtering using the vector median filter. An implementation of the algorithm on multi-core interface using the Open MultiProcessing (OpenMP) is presented. A performance analysis with large images is conducted. Performance is evaluated in terms of execution time and in terms of PSNR. Results show that the proposed filter obtains good performance in terms of

 

Parallel Insertion Merge

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

63

Parallel Insertion Merge

Fernando Belmiro do Couto1 and Fabio Silva do Couto2

1

Divisao de Ti, BBDTVM, Rio de Janeiro, RJ, Brasil

2

UFRJ, Rio de Janeiro, RJ, Brasil

PDPTA

Abstract – Parallel Insertion Merge is an algorithm that maps the subsets, ordering them with merge and insertion routines and using parallel processing. Most sort algorithms in use nowadays treat the mass of data without previously analyzing its distribution, no matter if dealing with partially ordered datasets. According to the methodology applied in this algorithm the whole data is in the worst case ordered each two elements. According to the same method we prove that at random distribution, are statistically distributed among subsets of 2 + 3 elements.

The major parallel routine order two subsets using X threads in the lowest subset, dividing this subset in X parts, and searching the position of the edge elements in highest subset. The next step starts without consider the edge elements used and if they point to the same target, not using the elements between them, because all those elements have their positions assigned.

 

Maximizing Hardware Performance via Non-blocking Collective Communication for All Pairs Shortest Paths Computation on Heterogeneous Multi-core Processors

PDF

66

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

Maximizing Hardware Performance via Non-blocking

Collective Communication for All Pairs Shortest Paths

Computation on Heterogeneous Multi-core Processors

Eduardo A. Colmenares1, Yu Zhuang1

1

Department of Computer Science, Texas Tech University, Lubbock, Texas, USA

Abstract - Dynamic data sharing among cores during computation in a multi-core architectural environment has been recognized as one of the factors that add to the cost of the total execution time. One way of reducing the impact of the latency generated by intense data sharing is through communication hiding.

The idea behind communication hiding is to create opportunities in the computation algorithms that can make use of system and/or hardware resources that allow the processors to engage in useful work while the sharing of information is taking place. One of the possible ways to achieve overlapping of communication and computation is by making use of non-blocking resources. In this paper, we present an algorithmically restructured solution for the All

 

How to find the best rules for CA-based PRNG

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

71

How to find the best rules for CA-based PRNG

Miroslaw Szaban

Institute of Computer Science

Siedlce University

Poland

E-mail: mszaban@uph.edu.pl

Abstract—In this paper is consider a problem of generation by cellular automata of high quality pseudorandom sequences useful in cryptography. For this purpose one dimensional nonuniform cellular automata (1D CA) is used. The quality of pseudorandom bit sequences generated by cellular automata depends on collective behavior of rules assigned to cellular automata cells.

In the paper is presented new method of constructing CA rules, based on balance of the CA and CA rule. The paper gives an answer how to find rules, which do not generate stable sequences, passes cryptographical tests (provide high quality pseudorandom sequences) and show maximal as possible set of such CA rules.

So, presented set of rules is the best for use in 1D CA-based

Pseudorandom Number Generator (PRNG). Sequences generated by these PRNB are suitable for symmetric key cryptography and can be used in different cryptographic algorithms.

 

Session - Communication Systems and Input Output Systems + Interconnection Networks and Topologies + Routing Methods + New Protocols + VANET + Peer-to-Peer Networks + Sensor Networks and Applications

PDF

 

A Fault-tolerant Routing Algorithm based on Safety Levels in a Hyper-Star Graph

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

77

A Fault-tolerant Routing Algorithm based on Safety Levels in a

Hyper-Star Graph

Yuko Sasaki

Yuki Hirai

Hironori Nakajo

Keiichi Kaneko

Graduate School of Engineering

Tokyo University of Agriculture and Technology

Koganei-shi, Tokyo 184-8588, JAPAN

{50013646114@st,yhirai@cc,nakajo@cc,k1kaneko@cc}.tuat.ac.jp

Abstract A hyper-star graph HS(n, k) provides a promising topology for interconnection networks of parallel processing systems because it combines the merits of a hypercube and a star graph. In this study, we propose a fault-tolerant routing algorithm that establishes a fault-free path between any pair of non-faulty nodes in an HS(n, k) with faulty nodes by using limited global information called safety levels. In addition, we carried out a computer experiment to verify the effectiveness of the algorithm.

If each non-faulty node collects information of all faulty nodes as global information, optimal fault-tolerant routing is possible. However, this approach is impractical since it requires space and time complexities, whose orders are equal to the number of nodes in the graph. On the other hand, if each non-faulty node collects the status of its neighbor nodes only as local information for fault-tolerant routing, high reachability cannot be attained. Therefore, some approaches collect a part of the global information to attain high reachability. The information is called limited global information.

 

A Novel Quorum Protocol for Improved Performance

PDF

84

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

A Novel Quorum Protocol for Improved Performance

A. Parul Pandey1 , B. M Tripathi2

2 Computer Science, Institution of Engineering and Technology , Lucknow, U.P., India

Abstract— In this paper, we present an efficient quorum protocol for reading data with minimum read quorum size. This protocol for managing replicated data is named as Wheel Quorum Protocol. We impose a logical wheel structure on the set of copies of an object. The protocol ensures minimum read quorum size of one, by reading one copy of an object while maintaining acceptable size of write operations. In this paper, we also analyze several quorum types in terms of quorum size and message overhead. Our protocol proves to incur minimum communication overhead. Wheel structure has a wider application area as it can be imposed in a network with any number of nodes. This protocol is especially beneficial for read intensive applications.

Keywords: Replica-control, distributed database, quorum consensus.

 

Incentive Scheme for P2P Live Streaming Systems Being Aware of the Upload Capability of the Participants

PDF

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'14 |

91

Incentive Scheme for P2P Live Streaming Systems

Being Aware of the Upload Capability of the Participants

Shogo KANDA1 and Satoshi FUJITA1

1 Department of Information Engineering, Hiroshima University

Higashi-Hiroshima, 739-8527, Japan

Abstract— In this paper, we propose an incentive scheme for P2P live streaming systems being aware of the upload bandwidth of the participants. The basic idea of the scheme is to combine a point-based incentive scheme with the notion of minimum guaranteed services. The performance of the proposed scheme is evaluated by simulation. The result of simulation indicates that: 1) the proposed scheme increases the utility of poor peers by 30% compared with Sepidar which is a typical auction-based incentive scheme and 2) compared with the Chu’s scheme, which is a typical taxationbased scheme, the proposed scheme gradually increases the utility of rich peers as the amount of contributions increase, whereas the utility under the Chu’s scheme does not change after reaching the limit determined by the taxation rate.

 

Load more


Details

Print Book
E-Books
Slices

Format name
PDF
Encrypted
No
Sku
B000000032085
Isbn
9781601323262
File size
27.8 MB
Printing
Allowed
Copying
Allowed
Read aloud
Allowed
Format name
PDF
Encrypted
No
Printing
Allowed
Copying
Allowed
Read aloud
Allowed
Sku
In metadata
Isbn
In metadata
File size
In metadata