論文の公開元へ

書き出し

Refer/BibIX

RIS

BibTeX

TSV

Skew-Aware Collective Communication for MapReduce Shuffling

建部, 修見 DAIKOKU, Harunobu KAWASHIMA, Hideyuki 筑波大学 DOI:10.1587/transinf.2019PAP0019

2020.09.18

概要

This paper proposes and examines the three in-memory shuffling methods designed to address problems in MapReduce shuffling caused by skewed data. Coupled Shuffle Architecture (CSA) employs a single pairwise all-to-all exchange to shuffle both blocks, units of shuffle transfer, and meta-blocks, which contain the metadata of corresponding blocks. Decoupled Shuffle Architecture (DSA) separates the shuffling of meta-blocks and blocks, and applies different all-to-all exchange algorithms to each shuffling process, attempting to mitigate the impact of stragglers in strongly skewed distributions. Decoupled Shuffle Architecture with Skew-Aware Meta-Shuffle (DSA w/ SMS) autonomously determines the proper placement of blocks based on the memory consumption of each worker process. This approach targets extremely skewed situations where some worker processes could exceed their node memory limitation. This study evaluates implementations of the three shuffling methods in our prototype in-memory MapReduce engine, which employs high performance interconnects such as InfiniBand and Intel Omni-Path. Our results suggest that DSA w/ SMS is the only viable solution for extremely skewed data distributions. We also present a detailed investigation of the performance of CSA and DSA in various skew situations.

論文の公開元へ

この論文で使われている画像

参考文献

While the focus of this paper and the work described

above is on addressing the skew problem at the framework

level, oﬀering an application-specific remedy could be another eﬀective way to mitigate the skewness. The work in

CloudRAMSort [31] and SDS-Sort [32] are two examples of

addressing the skewed data in the context of distributed sorting. The idea is to utilize knowledge about the data itself

by partitioning the keyspace based on a result of randomsampling before the data is passed to a lower-level framework like MPI. Our solution, on the other hand, does not

rely on any characteristics of the raw data, but only on the

size of serialized byte arrays. Thus it is a more generic solution for the problem, and intends to reduce the burden on

the application developers.

Conclusions

In this paper, we proposed and examined the three in-

[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol.51, no.1, pp.107–113,

Jan. 2008.

[2] T. Hoefler, A. Lumsdaine, and J. Dongarra, “Towards Eﬃcient

MapReduce Using MPI,” Recent Advances in Parallel Virtual

Machine and Message Passing Interface, vol.5759, pp.240–249,

Springer Berlin Heidelberg, 2009.

[3] H. Mohamed and S. Marchand-Maillet, “MRO-MPI: MapReduce

overlapping using MPI and an optimized data exchange policy,” Parallel. Comput., vol.39, no.12, pp.851–866, 2013.

[4] M. Matsuda, N. Maruyama, and S. Takizawa, “K MapReduce: A

scalable tool for data-processing and search/ensemble applications

on large-scale supercomputers,” 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp.1–8, 2013.

[5] Y. Guo, W. Bland, P. Balaji, and X. Zhou, “Fault Tolerant MapReduce-MPI for HPC Clusters,” Proc. International Conference for

High Performance Computing, Networking, Storage and Analysis,

SC ’15, New York, NY, USA, pp.34:1–34:12, ACM, 2015.

[6] T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M.

Taufer, “Mimir: Memory-Eﬃcient and Scalable MapReduce for

IEICE TRANS. INF. & SYST., VOL.E102–D, NO.12 DECEMBER 2019

2398

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

Large Supercomputing Systems,” 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.1098–1108,

May 2017.

Apache Hadoop, http://hadoop.apache.org/.

Apache Spark, http://spark.apache.org/.

In-Memory MapReduce - Apache Ignite, https://ignite.apache.org/

features/mapreduce.html.

S. Matsuoka, H. Sato, O. Tatebe, M. Koibuchi, I. Fujiwara, S.

Suzuki, M. Kakuta, T. Ishida, Y. Akiyama, T. Suzumura, K.

Ueno, H. Kanezashi, and T. Miyoshi, “Extreme Big Data (EBD):

Next Generation Big Data Infrastructure Technologies Towards

Yottabyte/Year,” Supercomputing Frontiers and Innovations, vol.1,

no.2, 2014.

Collective Communication, https://www.mpi-forum.org/docs/mpi1.1/mpi-11-html/node63.html.

The standardization forum for the Message Passing Interface (MPI),

http://mpi-forum.org.

C. Xu, M.G. Venkata, R.L. Graham, Y. Wang, Z. Liu, and W. Yu,

“SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems,” 2013 13th IEEE/ACM International

Symposium on Cluster, Cloud, and Grid Computing, pp.369–376,

May 2013.

About Oakforest-PACS, http://jcahpc.jp/eng/ofp intro.html.

What is K?, http://www.aics.riken.jp/en/k-computer/about/.

H. Daikoku, Source code of the prototype MapReduce implementation, https://bitbucket.org/hdaikoku/pmr/src/devel/.

R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of Collective Communication Operations in MPICH,” The International

Journal of High Performance Computing Applications, vol.19, no.1,

pp.49–66, 2005.

Implementation of MPI Alltoallv in mpich-3.2.1, https://github.com/

pmodels/mpich/blob/v3.2.1/src/mpi/coll/alltoallv.c.

Implementation of MPI Alltoall in mpich-3.2.1, https://github.com/

pmodels/mpich/blob/v3.2.1/src/mpi/coll/alltoall.c.

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,

M.J. Franklin, S. Shenker, and I. Stoica, “Resilient Distributed

Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing,” Proc. 9th USENIX Conference on Networked Systems Design and Implementation, NSDI ’12, Berkeley, CA, USA, pp.15–28,

USENIX Association, 2012.

Libfabric, https://ofiwg.github.io/libfabric/.

K. Elmeleegy, C. Olston, and B. Reed, “SpongeFiles: Mitigating

Data Skew in Mapreduce Using Distributed Memory,” Proc. 2014

ACM SIGMOD International Conference on Management of Data,

SIGMOD ’14, New York, NY, USA, pp.551–562, ACM, 2014.

J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K.G. Shin, “Eﬃcient

memory disaggregation with Infiniswap,” 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17),

pp.649–667, USENIX Association, March 2017.

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “SkewTune: Mitigating skew in Mapreduce applications,” Proc. 2012 ACM SIGMOD

International Conference on Management of Data, SIGMOD ’12,

New York, NY, USA, pp.25–36, ACM, 2012.

S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, and S. Wu, “Handling partitioning skew in MapReduce using LEEN,” Peer-to-Peer

Networking and Applications, vol.6, no.4, pp.409–424, Dec. 2013.

Y. Gao, Y. Zhou, B. Zhou, L. Shi, and J. Zhang, “Handling Data

Skew in MapReduce Cluster by Using Partition Tuning,” Journal of

Healthcare Engineering, vol.2017, pp.1–12, 2017.

B. Nicolae, C.H.A. Costa, C. Misale, K. Katrinis, and Y. Park,

“Leveraging Adaptive I/O to Optimize Collective Data Shuﬄing

Patterns for Big Data Analytics,” IEEE Trans. Parallel Distrib. Syst.,

vol.28, no.6, pp.1663–1674, 2017.

M. Wasi-ur-Rahman, N.S. Islam, X. Lu, J. Jose, H. Subramoni, H.

Wang, and D.K. Panda, “High-Performance RDMA-based Design

of Hadoop MapReduce over InfiniBand,” 2013 IEEE International

Symposium on Parallel & Distributed Processing, Workshops and

Phd Forum, pp.1908–1917, 2013.

[29] X. Lu, D. Shankar, S. Gugnani, and D.K. Panda, “High-performance

design of apache spark with RDMA and its benefits on various workloads,” 2016 IEEE International Conference on Big Data (Big Data),

pp.253–262, Dec. 2016.

[30] H. Daikoku, H. Kawashima, and O. Tatebe, “On Exploring Eﬃcient

Shuﬄe Design for In-memory MapReduce,” Proc. 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and

Beyond, BeyondMR ’16, pp.6:1–6:10, 2016.

[31] C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani,

“CloudRAMSort: Fast and Eﬃcient Large-scale Distributed RAM

Sort on Shared-nothing Cluster,” Proc. 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12,

pp.841–850, 2012.

[32] B. Dong, S. Byna, and K. Wu, “SDS-Sort: Scalable Dynamic

Skew-aware Parallel Sorting,” Proc. 25th ACM International Symposium on High-Performance Parallel and Distributed Computing,

HPDC ’16, pp.57–68, 2016.

[33] IMEFLASH-NATIVE

DATA CACHE — DDN, https://www.ddn.

com/products/ime-flash-native-data-cache/.

Appendix:

Skew Dealing Method with Shared File System

As described in Sect. 1, this study assumes a target environment of typical HPC clusters such as Oakforest-PACS

(OFP) and K computer, which have no local disks on computing nodes. Since accessing files on shared storages, such

as Lustre or GlusterFS, is relatively expensive, such an environment makes the spill-to-storage solution employed by

Hadoop and Spark less appealing. This section presents the

results of the performance comparison between the DSA w/

SMS method and the so-called On-Disk method.

Figure A· 1 gives an overview of block transfer in the

On-Disk method. In the meta-shuﬄe phase, the On-Disk

method checks if the blocks can fit in node memory in the

same way as the DSA w/ SMS method. Unlike the DSA w/

SMS method, the On-Disk method spills overflowed blocks

over into the shared file system, making the blocks globally accessible to the remote processes. In the block-shuﬄe

phase, block transfers over the network, as well as the shared

file system, are performed. After the shuﬄing, each reduce

task tries to fetch its own blocks from the local memory or

the shared file system via the Shuﬄe Handler module.

We implemented the On-Disk method in our prototype

MapReduce engine described in Sect. 4. We again evaluated its performance evaluation on 128 computing nodes of

the OFP. There are two file systems available on the OFP:

the HDD-based Lustre file system, and the SSD-based burst

buﬀer, also known as IME [33]. The IME can be used for

caching files on the Lustre, as well as for storing tempo-

Fig. A· 1

Block transfer via shared file system

DAIKOKU et al.: SKEW-AWARE COLLECTIVE COMMUNICATION FOR MAPREDUCE SHUFFLING

2399

Fig. A· 2

Comparison of skew tolerance: DSA w/ SMS & On-Disk

rary files during job execution. For this experiment, the

temporary directory for block spilling was set to either the

Lustre or the IME, and we measured the shuﬄe execution

time for each scenario. We also investigated the eﬀect of

network performance by executing the workload using two

diﬀerent network libraries, BSD socket and OFI libfabric,

enabled respectively. The former library utilizes the IPoIB

network of Intel Omni-Path, while the latter library is expected to demonstrate better performance because it leverages Omni-Path’s native psm2 interface. All other parameters precisely replicated the conditions of the experiments

detailed in Sect. 5.

Figure A· 2 shows the execution times of the two shuffling methods, DSA w/ SMS and On-Disk, as the skewness

parameter was increased from 2.5 to 3.0, in increments of

0.1. The Shuﬄe segments of each bar represent the overall

shuﬄe execution time, while the Block Read segments represent the amount of time that reduce tasks spent fetching

blocks from the Shuﬄe Handler module and deallocating

their memory regions. For the On-Disk method, the Shuﬄe

segment includes the time spent spilling blocks over into the

shared file system, and the Block Read segment includes the

time spent reading the blocks from the shared file system.

The results, displayed in Fig. A· 2 (a), suggest that,

when using OFI libfabric, the advantage of in-memory shuffling in DSA w/ SMS becomes more meaningful as skewness increases. When the skewness was 3.0, the DSA w/

SMS improved the shuﬄing performance over the On-Disk

(Lustre) and the On-Disk (IME) by factors of 2.49 and 1.39,

respectively. Furthermore, the impact of expensive file I/Os

becomes more apparent in the Block Read phase. In the

Shuﬄe phase, multiple processes were spilling blocks to the

file system in parallel, whereas, in the Block Read phase,

only the single straggler was reading the blocks, resulting

in the relatively longer execution of the phase. Although the

IME improved the Block Read execution time by a factor

of up to 3.23, the DSA w/ SMS method still achieved the

fastest overall execution times in higher skew situations.

When using BSD socket, displayed in Fig. A· 2 (b), the

DSA w/ SMS method was no longer the best overall solution. For example, when the skewness was 2.9, the On-Disk

(IME) was 7.20% faster than the DSA w/ SMS. The relatively poor performance of the IPoIB network is the most

likely cause of the observed performance degradation.

Based on our analysis of the above results, we conclude

that DSA w/ SMS is not a promising solution for the skew

problem unless applied to systems equipped with high performance interconnects.

Harunobu Daikoku

received M.Eng. from

University of Tsukuba, Japan in 2018. His research area is high-performance computing and

distributed system software.

Hideyuki Kawashima

received Ph.D. from

Science for Open and Environmental Systems

Graduate School of Keio University, Japan. He

was a research associate at Department of Science and Engineering, Keio University from

2005 to 2007. From 2007 to 2011, he was an

assistant professor at both Graduate School of

Systems and Information Engineering and Center for Computational Sciences, University of

Tsukuba, Japan. From 2016 to 2018, he was an

associate professor at Center for Computational

Sciences, University of Tsukuba. From 2018, he is an associate professor

at in Environment and Information Studies, Keio University.

Osamu Tatebe

received a Ph.D. in computer

science (1997, Univ. of Tokyo). He worked

at Electrotechnical Laboratory (ETL), and National Institute of Advanced Industrial Science

and Technology (AIST) until 2006. He is now a

professor in Center for Computational Sciences

at University of Tsukuba. His research area

is high-performance computing, data-intensive

computing, and parallel and distributed system

software.

...

参考文献をもっと見る

分野

大学

学位論文種類・取得年

言語

Skew-Aware Collective Communication for MapReduce Shuffling

概要

この論文で使われている画像

関連論文

A Highly-Efficient and Tightly-Connected Many-Core Overlay Architecture

A study on hopping based control channel establishment for dynamic spectrum access (本文)

余事象モデルに基づく未学習クラス推定法と生体信号分類への応用

Efficient Convolutional Neural Networks for Brain Machine Interface Systems : A transfer learning approach

Multiplicative Sparse Tensor Factorization for Multi-View Multi-Task Learning

参考文献

分野

大学

学位論文種類・取得年

言語

コピーが完了しました

URLをコピーしました

Skew-Aware Collective Communication for MapReduce Shuffling

概要

この論文で使われている画像

関連論文

A Highly-Efficient and Tightly-Connected Many-Core Overlay Architecture

A study on hopping based control channel establishment for dynamic spectrum access (本文)

余事象モデルに基づく未学習クラス推定法と生体信号分類への応用

Efficient Convolutional Neural Networks for Brain Machine Interface Systems : A transfer learning approach

Multiplicative Sparse Tensor Factorization for Multi-View Multi-Task Learning

参考文献