Extending Database Technology with Fundamentals of Text Analytics
Modern technological and social trends, such as mobile computing and social networking, result in an enormous amount of publicly available data with a high potential value within. Contemporary business models, such as cloud computing, open-source software and crowd sourcing, provide the means for analysis without the resources of grand enterprises. But that data have characteristics that challenge traditional database systems. Due to the uncontrolled nature by which data is produced, much of it is free text, often in informal natural language, leading to computing environments with high levels of uncertainty and error. Traditional database systems are based on models that fundamentally lack the ability to deeply process text and reason about uncertainty thereof; hence, existing solutions are often software bundles that combine databases, scripts for text extraction, NLP algorithms, and statistical libraries. The goal of this project is to establish foundations and implementations of data management systems that capture the nature of modern text analysis, to facilitate, expedite, and simplify application development.
- Benny Kimelfeld: Database principles in information extraction. PODS 2014: 156-163
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, Stijn Vansummeren: Cleaning inconsistencies in information extraction via prioritized repairs. PODS 2014: 164-175
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, Stijn Vansummeren: Spanners: a formal framework for information extraction. PODS 2013: 37-48
Advancing Frequent Subgraph Mining
The need to discover and generate frequent graph patterns arises in applications that span a wide spectrum, such as mining molecular data and classification of chemical com- pounds, behavior and link analysis in social networks, workflow analysis, and text classification. Several algorithms for generating (maximal) frequent subgraphs have been proposed and evaluated experimentally, but until recently not much has been known about the computational complexity of the fundamental related enumeration and decision problems. In a recent work, we conducted on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs, taking into account various key variables of the problem, such as possible restrictions on the classes of graphs and patterns, bounding the threshold, and bounding the number of desired answers. Within that investigation, we devised a novel algorithmic framework based on our previous work on hereditary graph properties. This framework is inherently different from previous algorithms for mining maximal frequent subgraphs, as it is incremental in nature (i.e., commits to answers as it runs), and it provides efficiency guarantees (polynomial delay) in certain natural cases. The project aims to advance our understanding and the practical usability of mining maximal frequent subgraphs. We are especially interested in usability in the context of feature generation in machine learning; there, frequent patterns capture recurring patterns in structure, and in turn can be used as features.
- Benny Kimelfeld, Phokion G. Kolaitis: The Complexity of Mining Maximal Frequent Subgraphs. ACM Trans. Database Syst. 39(4): 32 (2014)
- Conference version: Benny Kimelfeld, Phokion G. Kolaitis: The complexity of mining maximal frequent subgraphs. PODS 2013: 13-24
Sara Cohen, Benny Kimelfeld, Yehoshua Sagiv: Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci. 74(7): 1147-1159 (2008)
Query Suggestion in Complex Schemas
- Benny Kimelfeld, Yehoshua Sagiv: Extracting minimum-weight tree patterns from a schema with neighborhood constraints. ICDT 2013: 249-260
- Benny Kimelfeld, Yehoshua Sagiv: Finding a minimal tree pattern under neighborhood constraints. PODS 2011: 235-246
- Ronald Fagin, Benny Kimelfeld, Yunyao Li, Sriram Raghavan, Shivakumar Vaithyanathan: Understanding queries in a search database system. PODS 2010: 273-284
- Ronald Fagin, Benny Kimelfeld, Yunyao Li, Sriram Raghavan, Shivakumar Vaithyanathan: Rewrite rules for search database systems. PODS 2011: 271-282
Konstantin Golenberg, Benny Kimelfeld, Yehoshua Sagiv: Keyword proximity search in complex data graphs. SIGMOD Conference 2008: 927-940