While the focus of this paper and the work described
above is on addressing the skew problem at the framework
level, offering an application-specific remedy could be another effective way to mitigate the skewness. The work in
CloudRAMSort [31] and SDS-Sort [32] are two examples of
addressing the skewed data in the context of distributed sorting. The idea is to utilize knowledge about the data itself
by partitioning the keyspace based on a result of randomsampling before the data is passed to a lower-level framework like MPI. Our solution, on the other hand, does not
rely on any characteristics of the raw data, but only on the
size of serialized byte arrays. Thus it is a more generic solution for the problem, and intends to reduce the burden on
the application developers.
7.
Conclusions
In this paper, we proposed and examined the three in-
[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol.51, no.1, pp.107–113,
Jan. 2008.
[2] T. Hoefler, A. Lumsdaine, and J. Dongarra, “Towards Efficient
MapReduce Using MPI,” Recent Advances in Parallel Virtual
Machine and Message Passing Interface, vol.5759, pp.240–249,
Springer Berlin Heidelberg, 2009.
[3] H. Mohamed and S. Marchand-Maillet, “MRO-MPI: MapReduce
overlapping using MPI and an optimized data exchange policy,” Parallel. Comput., vol.39, no.12, pp.851–866, 2013.
[4] M. Matsuda, N. Maruyama, and S. Takizawa, “K MapReduce: A
scalable tool for data-processing and search/ensemble applications
on large-scale supercomputers,” 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp.1–8, 2013.
[5] Y. Guo, W. Bland, P. Balaji, and X. Zhou, “Fault Tolerant MapReduce-MPI for HPC Clusters,” Proc. International Conference for
High Performance Computing, Networking, Storage and Analysis,
SC ’15, New York, NY, USA, pp.34:1–34:12, ACM, 2015.
[6] T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M.
Taufer, “Mimir: Memory-Efficient and Scalable MapReduce for
IEICE TRANS. INF. & SYST., VOL.E102–D, NO.12 DECEMBER 2019
2398
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
Large Supercomputing Systems,” 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.1098–1108,
May 2017.
Apache Hadoop, http://hadoop.apache.org/.
Apache Spark, http://spark.apache.org/.
In-Memory MapReduce - Apache Ignite, https://ignite.apache.org/
features/mapreduce.html.
S. Matsuoka, H. Sato, O. Tatebe, M. Koibuchi, I. Fujiwara, S.
Suzuki, M. Kakuta, T. Ishida, Y. Akiyama, T. Suzumura, K.
Ueno, H. Kanezashi, and T. Miyoshi, “Extreme Big Data (EBD):
Next Generation Big Data Infrastructure Technologies Towards
Yottabyte/Year,” Supercomputing Frontiers and Innovations, vol.1,
no.2, 2014.
Collective Communication, https://www.mpi-forum.org/docs/mpi1.1/mpi-11-html/node63.html.
The standardization forum for the Message Passing Interface (MPI),
http://mpi-forum.org.
C. Xu, M.G. Venkata, R.L. Graham, Y. Wang, Z. Liu, and W. Yu,
“SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems,” 2013 13th IEEE/ACM International
Symposium on Cluster, Cloud, and Grid Computing, pp.369–376,
May 2013.
About Oakforest-PACS, http://jcahpc.jp/eng/ofp intro.html.
What is K?, http://www.aics.riken.jp/en/k-computer/about/.
H. Daikoku, Source code of the prototype MapReduce implementation, https://bitbucket.org/hdaikoku/pmr/src/devel/.
R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of Collective Communication Operations in MPICH,” The International
Journal of High Performance Computing Applications, vol.19, no.1,
pp.49–66, 2005.
Implementation of MPI Alltoallv in mpich-3.2.1, https://github.com/
pmodels/mpich/blob/v3.2.1/src/mpi/coll/alltoallv.c.
Implementation of MPI Alltoall in mpich-3.2.1, https://github.com/
pmodels/mpich/blob/v3.2.1/src/mpi/coll/alltoall.c.
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
M.J. Franklin, S. Shenker, and I. Stoica, “Resilient Distributed
Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing,” Proc. 9th USENIX Conference on Networked Systems Design and Implementation, NSDI ’12, Berkeley, CA, USA, pp.15–28,
USENIX Association, 2012.
Libfabric, https://ofiwg.github.io/libfabric/.
K. Elmeleegy, C. Olston, and B. Reed, “SpongeFiles: Mitigating
Data Skew in Mapreduce Using Distributed Memory,” Proc. 2014
ACM SIGMOD International Conference on Management of Data,
SIGMOD ’14, New York, NY, USA, pp.551–562, ACM, 2014.
J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K.G. Shin, “Efficient
memory disaggregation with Infiniswap,” 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17),
pp.649–667, USENIX Association, March 2017.
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “SkewTune: Mitigating skew in Mapreduce applications,” Proc. 2012 ACM SIGMOD
International Conference on Management of Data, SIGMOD ’12,
New York, NY, USA, pp.25–36, ACM, 2012.
S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, and S. Wu, “Handling partitioning skew in MapReduce using LEEN,” Peer-to-Peer
Networking and Applications, vol.6, no.4, pp.409–424, Dec. 2013.
Y. Gao, Y. Zhou, B. Zhou, L. Shi, and J. Zhang, “Handling Data
Skew in MapReduce Cluster by Using Partition Tuning,” Journal of
Healthcare Engineering, vol.2017, pp.1–12, 2017.
B. Nicolae, C.H.A. Costa, C. Misale, K. Katrinis, and Y. Park,
“Leveraging Adaptive I/O to Optimize Collective Data Shuffling
Patterns for Big Data Analytics,” IEEE Trans. Parallel Distrib. Syst.,
vol.28, no.6, pp.1663–1674, 2017.
M. Wasi-ur-Rahman, N.S. Islam, X. Lu, J. Jose, H. Subramoni, H.
Wang, and D.K. Panda, “High-Performance RDMA-based Design
of Hadoop MapReduce over InfiniBand,” 2013 IEEE International
Symposium on Parallel & Distributed Processing, Workshops and
Phd Forum, pp.1908–1917, 2013.
[29] X. Lu, D. Shankar, S. Gugnani, and D.K. Panda, “High-performance
design of apache spark with RDMA and its benefits on various workloads,” 2016 IEEE International Conference on Big Data (Big Data),
pp.253–262, Dec. 2016.
[30] H. Daikoku, H. Kawashima, and O. Tatebe, “On Exploring Efficient
Shuffle Design for In-memory MapReduce,” Proc. 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and
Beyond, BeyondMR ’16, pp.6:1–6:10, 2016.
[31] C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani,
“CloudRAMSort: Fast and Efficient Large-scale Distributed RAM
Sort on Shared-nothing Cluster,” Proc. 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12,
pp.841–850, 2012.
[32] B. Dong, S. Byna, and K. Wu, “SDS-Sort: Scalable Dynamic
Skew-aware Parallel Sorting,” Proc. 25th ACM International Symposium on High-Performance Parallel and Distributed Computing,
HPDC ’16, pp.57–68, 2016.
[33] IMEFLASH-NATIVE
DATA CACHE — DDN, https://www.ddn.
com/products/ime-flash-native-data-cache/.
Appendix:
Skew Dealing Method with Shared File System
As described in Sect. 1, this study assumes a target environment of typical HPC clusters such as Oakforest-PACS
(OFP) and K computer, which have no local disks on computing nodes. Since accessing files on shared storages, such
as Lustre or GlusterFS, is relatively expensive, such an environment makes the spill-to-storage solution employed by
Hadoop and Spark less appealing. This section presents the
results of the performance comparison between the DSA w/
SMS method and the so-called On-Disk method.
Figure A· 1 gives an overview of block transfer in the
On-Disk method. In the meta-shuffle phase, the On-Disk
method checks if the blocks can fit in node memory in the
same way as the DSA w/ SMS method. Unlike the DSA w/
SMS method, the On-Disk method spills overflowed blocks
over into the shared file system, making the blocks globally accessible to the remote processes. In the block-shuffle
phase, block transfers over the network, as well as the shared
file system, are performed. After the shuffling, each reduce
task tries to fetch its own blocks from the local memory or
the shared file system via the Shuffle Handler module.
We implemented the On-Disk method in our prototype
MapReduce engine described in Sect. 4. We again evaluated its performance evaluation on 128 computing nodes of
the OFP. There are two file systems available on the OFP:
the HDD-based Lustre file system, and the SSD-based burst
buffer, also known as IME [33]. The IME can be used for
caching files on the Lustre, as well as for storing tempo-
Fig. A· 1
Block transfer via shared file system
DAIKOKU et al.: SKEW-AWARE COLLECTIVE COMMUNICATION FOR MAPREDUCE SHUFFLING
2399
Fig. A· 2
Comparison of skew tolerance: DSA w/ SMS & On-Disk
rary files during job execution. For this experiment, the
temporary directory for block spilling was set to either the
Lustre or the IME, and we measured the shuffle execution
time for each scenario. We also investigated the effect of
network performance by executing the workload using two
different network libraries, BSD socket and OFI libfabric,
enabled respectively. The former library utilizes the IPoIB
network of Intel Omni-Path, while the latter library is expected to demonstrate better performance because it leverages Omni-Path’s native psm2 interface. All other parameters precisely replicated the conditions of the experiments
detailed in Sect. 5.
Figure A· 2 shows the execution times of the two shuffling methods, DSA w/ SMS and On-Disk, as the skewness
parameter was increased from 2.5 to 3.0, in increments of
0.1. The Shuffle segments of each bar represent the overall
shuffle execution time, while the Block Read segments represent the amount of time that reduce tasks spent fetching
blocks from the Shuffle Handler module and deallocating
their memory regions. For the On-Disk method, the Shuffle
segment includes the time spent spilling blocks over into the
shared file system, and the Block Read segment includes the
time spent reading the blocks from the shared file system.
The results, displayed in Fig. A· 2 (a), suggest that,
when using OFI libfabric, the advantage of in-memory shuffling in DSA w/ SMS becomes more meaningful as skewness increases. When the skewness was 3.0, the DSA w/
SMS improved the shuffling performance over the On-Disk
(Lustre) and the On-Disk (IME) by factors of 2.49 and 1.39,
respectively. Furthermore, the impact of expensive file I/Os
becomes more apparent in the Block Read phase. In the
Shuffle phase, multiple processes were spilling blocks to the
file system in parallel, whereas, in the Block Read phase,
only the single straggler was reading the blocks, resulting
in the relatively longer execution of the phase. Although the
IME improved the Block Read execution time by a factor
of up to 3.23, the DSA w/ SMS method still achieved the
fastest overall execution times in higher skew situations.
When using BSD socket, displayed in Fig. A· 2 (b), the
DSA w/ SMS method was no longer the best overall solution. For example, when the skewness was 2.9, the On-Disk
(IME) was 7.20% faster than the DSA w/ SMS. The relatively poor performance of the IPoIB network is the most
likely cause of the observed performance degradation.
Based on our analysis of the above results, we conclude
that DSA w/ SMS is not a promising solution for the skew
problem unless applied to systems equipped with high performance interconnects.
Harunobu Daikoku
received M.Eng. from
University of Tsukuba, Japan in 2018. His research area is high-performance computing and
distributed system software.
Hideyuki Kawashima
received Ph.D. from
Science for Open and Environmental Systems
Graduate School of Keio University, Japan. He
was a research associate at Department of Science and Engineering, Keio University from
2005 to 2007. From 2007 to 2011, he was an
assistant professor at both Graduate School of
Systems and Information Engineering and Center for Computational Sciences, University of
Tsukuba, Japan. From 2016 to 2018, he was an
associate professor at Center for Computational
Sciences, University of Tsukuba. From 2018, he is an associate professor
at in Environment and Information Studies, Keio University.
Osamu Tatebe
received a Ph.D. in computer
science (1997, Univ. of Tokyo). He worked
at Electrotechnical Laboratory (ETL), and National Institute of Advanced Industrial Science
and Technology (AIST) until 2006. He is now a
professor in Center for Computational Sciences
at University of Tsukuba. His research area
is high-performance computing, data-intensive
computing, and parallel and distributed system
software.
...