Günter LadwiG efficient Optimization and Processing of Queries over text-rich Graph-structured data Günter Ladwig Efficient Optimization and Processing of Queries over Text-rich Graph-structured Data Efficient Optimization and Processing of Queries over Text-rich Graph-structured Data by Günter Ladwig Diese Veröffentlichung ist im Internet unter folgender Creative Commons-Lizenz publiziert: http://creativecommons.org/licenses/by-nc-nd/3.0/de/ KIT Scientific Publishing 2013 Print on Demand ISBN 978-3-7315-0015-5 Dissertation, Karlsruher Institut für Technologie (KIT) Fakultät für Wirtschaftswissenschaften Tag der mündlichen Prüfung: 19.02.2013 Referenten: Prof. Dr. Rudi Studer, Prof. Dr. Heiner Stuckenschmidt Impressum Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe www.ksp.kit.edu KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft Efficient Optimization and Processing of Queries over Text-rich Graph-structured Data Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften (Dr.-Ing.) von der Fakultät für Wirtschaftswissenschaften des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Dipl.-Inform. Günter Ladwig Tag der mündlichen Prüfung: 19.02.2013 Referent: Prof. Dr. Rudi Studer Korreferent: Prof. Dr. Heiner Stuckenschmidt This thesis is dedicated to the memory of my mother, Barbara Ladwig (1945 - 1991) Abstract Many databases today are text-rich in that they not only capture structured, but also unstructured data. Hybrid data can take many forms, from databases that store text documents and structured data extracted from these documents to large parts of the Web that no longer consist of textual documents only, but often include large amounts of structured data. The combination of structured and unstructured data, also known as the integration of databases (DB) and information retrieval (IR), has been an important topic for some time and has also attracted commercial interest. In research, this problem has gained much attention, particularly the topic of querying text-rich structured data that we call hybrid data There are a multitude of query languages that have been proposed to access unstructured and structured data or a combination of both, i.e. hybrid data. In the same way that we distinguish data, we can also largely categorize these query languages into three classes: unstructured, structured, and hybrid queries. The efficient evaluation of all three types of queries is an important concern and is becoming even more so with the growing amount of data that has to be processed and queried. The central challenges associated with query processing, regardless of query type, is that the search space for finding valid query answers is very large . On a high level, the challenge is then to minimize the search space in order to reduce the effort for producing query results and thereby increasing overall query performance. In terms of query processing, this can be achieved by either decrease the amount of data to be processed or processing the data more efficiently. This thesis aims to tackle the challenge in both ways and examines processing techniques for all three types of queries. Concerning unstructured keyword queries, we propose a solution that em- ploys much more compact index structures for neighborhood lookups, thereby reducing the search space for query answers. Using these indexes, keyword search result exploration is reduced to the traditional database problem of top- k join processing, enabling results to be computed efficiently. In particular, this computation can be performed on data streams successively loaded from disk (i.e. does not require the entire input to be loaded at once into memory). To support xi Abstract this, we propose a top- k procedure based on the rank join operator, which not only computes the k -best results, but also selects query plans in a top- k fashion during the process. In experiments using large real-world datasets, the solution reduced storage requirements and also outperformed the state-of-the-art in terms of performance and scalability. Concerning structured queries over RDF data graphs, the topic of Linked Data query processing has recently gained attention. Linked Data query processing in- curs new challenges associated with the large amount of data sources, the limited access patterns that can be used to access the sources, and the lack of up-to-date knowledge about the sources. We propose a novel query processing strategy that combines knowledge available about previously indexed data sources with knowledge gained at run-time through online discovery of new sources to per- form run-time adaptation of query plans . Data sources are ranked according to their importance in order to report results as early as possible. This ranking is adapted at run-time to incorporate new knowledge and thereby increases query performance. We propose the symmetric index hash join (SIHJ), a novel operator that deals with the unpredictable nature of accessing data distributed over a large number of sources by employing stream-based processing techniques while still supporting the use of data stored in local indexes when available. Compared to previously proposed operators, SIHJ guarantees completeness with regard to the retrieved data sources and improves performance significantly. We observe that the problems of source selection and data processing have been treated as separate in previous work. To this end, we propose a multi- objective optimization framework for joint optimization of several query op- timization objectives , cost and output cardinality in particular. We propose a dynamic programming (DP) solution for the multi-objective optimization of this integrated process of source selection and query processing. It produces a set of Pareto-optimal query plans, which represent different trade-offs between optimization objectives. The challenge of using DP here is that after retrieval, sources can be re-used in different parts of the query, i.e. the source scan op- erators can be shared. Depending on the reusability of these operators, the cost of subplans may vary such that the cost function is no longer monotonic with regard to the combination of subplans. We provide a tight-bound solution, which takes this effect into account. In experiments on real world Linked Data, Pareto-optimal plans computed by our approach show benefits over suboptimal plans generated by existing solutions. Concerning hybrid queries, different types of languages have been proposed. However, we note there exists no standard hybrid query language for the more xii Abstract general graph-structured RDF data. We propose a full-text extension to SPARQL that extends Basic Graph Patterns (BGP) to Hybrid Graph Patterns (HGP) and thereby captures proprietary extensions employed by various RDF stores. We discuss the various types of hybrid search queries that can be supported with this model. Moreover, while there are many proposals for processing hybrid queries and ranking hybrid results, the problem of building indexes for supporting efficiently hybrid search is largely unexplored. We have identified two main directions of works. First, there are database extensions , which add keyword search support to databases by using a separate inverted index for textual data. The other direction is to build native indexes capturing both structured and textual data. We systematically study the differences among the various choices for native indexes and database extensions. We propose a general hybrid search index schema HybIdx that can be used to specify access patterns needed by the various query types. We perform a comprehensive experiment using several benchmark datasets and queries to systematically study HybIdx in several scenarios, from the text-centric retrieval of documents in Wikipedia and TREC collections annotated with structured data to structure-centric retrieval of data in IMDB and YAGO up to “pure” hybrid data formed by combining Wikipedia and DBpedia. Compared to native approaches, HybIdx provides superior performance for relational and document queries (outperforms the second best approach by up to three orders of magnitude) and yields results close to the ones achieved by the best “focused” solution for entity queries. As opposed to these solutions, it is more complete regarding the types of hybrid search queries that can be supported. While hybrid graph patterns make it easier for users to specify structured queries, knowledge of the structure in the data is still required. This structure information is useful, making up the difference between structured and keyword queries. However, users might be able to capture only some but not all the structure information of a query. Addressing this, we propose to add to BGPs not only the use of keywords but also the capability to relax its structure, using Flexible Hybrid Graph Patterns (fHGP). The flexibility introduced by fHGP results in ambiguity. We show how an fHGP can be translated into a set of unambiguous HGPs . Then, based on the introduced semantics of HGP, these HGP-interpretations of an fHGP can be processed using the proposed index scheme HybIdx. Instead of producing all results, top- k processing based on the pull/bound rank join (PBRJ) template for instance, can be used to restrict attention to the best results and to terminate early. Hence, processing fHGP inter- pretations can be cast as a multi-query processing problem. The main technical xiii Abstract contribution is the Multi-Query PBRJ . Compared to PBRJ, this extension pro- cesses several interpretations simultaneously to share their intermediate results. We introduce novel optimizations that are only possible with the Multi-Query PBRJ. With this, we show that run-time join order optimization is actually orthog- onal to the top- k mechanisms, and propose the use of probing sequence selectors to achieve that. We propose score bounds specific to the interpretations that are tighter than the PBRJ bound obtained for the whole query (all interpretations). They enable more aggressive pulling and bounding, hence earlier reporting of top- k results. Experiments show that sharing results of queries processed simul- taneously is several (3-5) times faster than processing the queries one-by-one (without sharing). Further, the join order optimization and more aggressive interpretation-specific pulling/bounding leads to consistent improvements. xiv Acknowledgements This thesis would not have been possible without the support and guidance of many people. First, I would like to thank my advisor Prof. Dr. Rudi Studer who provided the opportunity and the support I needed for my research. I would also like to thank Dr. Duc Thanh Tran, my frequent co-author and advisor that supported and motivated me during my work on this thesis. Many thanks also go out to my current and former colleagues at AIFB who provided an incredibly friendly and supportive atmosphere that I very much enjoyed working in. In particular, I would like to thank Daniel M. Herzig, Dr. Andreas Harth, Dr. Philipp Sorg, and Andreas Wagner, who all contributed in one way or the other to my work and research. Prof. Dr. Philipp Cimiano first employed me as a student assistant during his time at AIFB, thereby introducing me to research work in the first place, for which I am also thankful. Most of all, I am indebted to my family and friends for their support and encouragement, without which this thesis would not have been possible. I would like to thank Sarah for tolerating the many evenings and weekends spent in front of the computer and her loving support. I would not be where I am today without my parents Barbara and Helmut, whom I love very much. I dedicate this thesis to the memory of my mother who left us much too early. Günter Ladwig xv Contents Abstract xi Acknowledgements xv 1 Introduction 1 1.1 Hybrid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Querying Hybrid Data . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . 10 1.6 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . 14 2 Basics 17 2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Unstructured Queries . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Query Model . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Structured Queries . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Query Model . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Hybrid Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Hybrid Query: HGP . . . . . . . . . . . . . . . . . . . 30 2.4.2 Flexible Hybrid Query: fHGP . . . . . . . . . . . . . . 32 2.4.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Query Compilation and Execution . . . . . . . . . . . . . . . . 34 2.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 Generating Physical Query Plans . . . . . . . . . . . . . 36 2.5.3 Optimization Algorithm . . . . . . . . . . . . . . . . . 38 2.5.4 Query Execution . . . . . . . . . . . . . . . . . . . . . 39 xvii Contents 2.5.5 Adaptive Query Processing . . . . . . . . . . . . . . . . 39 3 Processing Unstructured Queries 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 d-length 2-Hop Cover . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Construction . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Keyword Query Processing . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Basic Join Operations . . . . . . . . . . . . . . . . . . 49 3.3.2 Integrated Query Plan . . . . . . . . . . . . . . . . . . 52 3.3.3 Top-k Keyword-Join Processing . . . . . . . . . . . . . 55 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Stream-based Linked Data Query Processing 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.1 Source Discovery and Ranking . . . . . . . . . . . . . . 69 4.1.2 Evaluation Strategies . . . . . . . . . . . . . . . . . . . 71 4.1.3 Remote and Local Linked Data Query Processing . . . . 73 4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.1 Mixed Query Evaluation Strategy . . . . . . . . . . . . 74 4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Linked Data Query Operators and Plans . . . . . . . . . . . . . 76 4.3.1 Linked Data Query Plans . . . . . . . . . . . . . . . . . 76 4.3.2 Symmetric Index Hash Join . . . . . . . . . . . . . . . 79 4.4 Query Planning and Optimization . . . . . . . . . . . . . . . . 86 4.4.1 Source Ranking . . . . . . . . . . . . . . . . . . . . . . 87 4.4.2 Estimating Cost and Cardinality of Plans . . . . . . . . 89 4.5 Run-time Adaptation of Query Plans . . . . . . . . . . . . . . . 90 4.5.1 Run-time Source Discovery . . . . . . . . . . . . . . . 90 4.5.2 Run-time Refinement . . . . . . . . . . . . . . . . . . . 91 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.7.1 Comparison of Evaluation Strategies . . . . . . . . . . . 95 4.7.2 Stream-based Linked Data Query Processing . . . . . . 101 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xviii