Research Projects

Extending Database Technology with Fundamentals of Text Analytics

Modern technological and social trends, such as mobile computing and social networking, result in an enormous amount of publicly available data with a high potential value within. Contemporary business models, such as cloud computing, open-source software and crowd sourcing, provide the means for analysis without the resources of grand enterprises. But that data have characteristics that challenge traditional database systems. Due to the uncontrolled nature by which data is produced, much of it is free text, often in informal natural language, leading to computing environments with high levels of uncertainty and error. Traditional database systems are based on models that fundamentally lack the ability to deeply process text and reason about uncertainty thereof; hence, existing solutions are often software bundles that combine databases, scripts for text extraction, NLP algorithms, and statistical libraries. The goal of this project is to establish foundations and implementations of data management systems that capture the nature of modern text analysis, to facilitate, expedite, and simplify application development.

Relevant Publications:

Advancing Frequent Subgraph Mining

The need to discover and generate frequent graph patterns arises in applications that span a wide spectrum, such as mining molecular data and classification of chemical com- pounds, behavior and link analysis in social networks, workflow analysis, and text classification. Several algorithms for generating (maximal) frequent subgraphs have been proposed and evaluated experimentally, but until recently not much has been known about the computational complexity of the fundamental related enumeration and decision problems. In a recent work, we conducted on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs, taking into account various key variables of the problem, such as possible restrictions on the classes of graphs and patterns, bounding the threshold, and bounding the number of desired answers. Within that investigation, we devised a novel algorithmic framework based on our previous work on hereditary graph properties.  This framework is inherently different from previous algorithms for mining maximal frequent subgraphs, as it is incremental in nature (i.e., commits to answers as it runs), and it provides efficiency guarantees (polynomial delay) in certain natural cases. The project aims to advance our understanding and the practical usability of mining maximal frequent subgraphs. We are especially interested in usability in the context of feature generation in machine learning; there, frequent patterns capture recurring patterns in structure, and in turn can be used as features.

Relevant Publications:

Query Suggestion in Complex Schemas

 This research aims to facilitate database querying in settings that involve large and complicated schemas (e.g., SAP schemas with thousands of relations). In this project we design and implement a system for automatic suggestion of database queries from keywords, natural text, and visual input from an interactive system. Such a system entails three main challenges. First, a nontrivial interactive interface is needed in order to allow users to express complicated schema relationships and understand system proposals. Second, the translation of vague phrases into valid queries (e.g., “Technion employee” into a join between Person, WorksFor, and Organization) requires nontrivial modeling that involves graphs, lexical databases and knowledge bases, and possibly machine learning and probabilistic modeling. Finally, all involved algorithms need to run in interactive response time.

Relevant Publications: