Abstract
The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids, and clouds. Yet it remains a challenge to harness the available power and move toward gracefully searching and retrieving from web-scale media collections. Several researchers have experimented with using automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small computing clusters. In this article, we describe a prototype of a (near) web-scale throughput-oriented MM retrieval service using the Spark framework running on the AWS cloud service. We present retrieval results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. We also present a publicly available demonstration retrieval system, running on our own servers, where the implementation of the Spark pipelines can be observed in practice using standard image benchmarks, and downloaded for research purposes. Finally, we describe a method to evaluate retrieval quality of the ever-growing high-dimensional index of the prototype, without actually indexing a web-scale media collection.
Original language | English |
---|---|
Article number | 65 |
Journal | ACM Transactions on Multimedia Computing, Communications and Applications |
Volume | 14 |
Issue number | 3s |
DOIs | |
Publication status | Published - Jun 2018 |
Bibliographical note
Funding Information:Part of the work of G. P. Guomundsson and M. J. Franklin was performed while they were at the AMPLab, University of California, Berkeley. The work of Gylf Pór Guomundsson was supported in part by the Inria@SiliconValley program. The research was also supported in part by DHS Award HSHQDC-16-3-00083, NSF CISE Expeditions Award CCF-1139158, DOE Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Apple Inc., Arimo, Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, HP, Huawei, Intel, Microsoft, Mitre, Pivotal, Samsung, Schlumberger, Splunk, State Farm, and VMware.
Funding Information:
Part of the work of G. Þ. Guðmundsson and M. J. Franklin was performed while they were at the AMPLab, University of California, Berkeley. The work of Gylfi Þór Guðmundsson was supported in part by the Inria@SiliconValley program. The research was also supported in part by DHS Award HSHQDC-16-3-00083, NSF CISE Expeditions Award CCF-1139158, DOE Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Apple Inc., Arimo, Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, HP, Huawei, Intel, Microsoft, Mitre, Pivotal, Samsung, Schlumberger, Splunk, State Farm, and VMware. Authors’ addresses: G. Þ. Guðmundsson, School of Computer Science, Reykjavik University, Menntavegi 1, 101 Reykjavik, Iceland; email: [email protected]; B. Þ. Jónsson, Computer Science Department, IT University of Copenhagen, Rued Langgaards Vej 7, 2300 Copenhagen S, Denmark; email: [email protected]; L. Amsaleg, IRISA-CNRS, Campus de Beaulieu, 35042 Rennes cedex, France; email: [email protected]; M. J. Franklin, Department of Computer Science, Ryerson Laboratory 152, University of Chicago, 1100 E 58th St., Chicago, IL 60637 USA; email: [email protected]. Authors’ current addressess: B. Þ. Jónsson, Reykjavik University, Iceland; G. Þ. Guðmundsson and M. J. Franklin, AMPLab, University of California, 465 Soda Hall, MC-1776, Berkeley, CA 94720-1776, USA. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2018 ACM 1551-6857/2018/06-ART65 $15.00 https://doi.org/10.1145/3209662
Publisher Copyright:
© 2018 ACM.
Other keywords
- Cloud computing
- Content-based image retrieval
- Distributed computing
- Scalability
- Spark