Querying Incomplete and Inconsistent Web Databases

Description
We are developing techniques for querying web databases in the presence of the imprecise nature of user queries as well as inconsistence in the data. More Information about the Project


ExpertNet: Collaboration Network for Intelligent Social Computing

Description
We are developing computational foundations and quantitative frameworks to model, optimize, and search collaborative social networks to expedite problem-solving and innovation. More Information about ExpertNet


SWAN: Smart Workflow Management

Description
We are developing techniques for workflow management, including workflow modeling, provenance reasoning, workflow search, and optimization, for both scientific workflows and business processes, for regular workflows as well as ad-hoc workflows. More Information about SWAN and its sub-project SmartFlow for managing ad-hoc workflows specifically.


Information Extraction -- A Database Centric Approach

Description
Traditionally information extraction systems are implemented as a pipeline of special-purpose processing modules, which necessitates extraction to be re-applied from scratch to the entire text corpus whenever the data, processing modules, or extraction goals change. we propose an innovative paradigm for information extraction: the parse trees that are output by natural language processing on textual documents are stored in a database, and then extraction is expressed as queries using our proposed structured query language on databases. Such a paradigm have several advantages:

Furthermore, to allow ordinary users to easily perform information extraction or keyword search on corpus without learning the structured query language, we are investigating techniques that automatically generate structured queries based on the user keyword query and its pseudo-relevance feedback to obtain high-quality results.

Publication
TKDE'12, ICDE'10 (demo), ICDE'06

People

Prof. Chitta Baral  Prof. Graciela Gonzalez  Prof. Steven Bird  Prof. Susan B. Davidson  Haejoong Lee  Yifeng Zheng 


XML Stream Processing

Description
There are many applications where data arrives continuously as a stream and requires on-line processing without loading it into a database, for example, real time monitoring for traffic or financial information. We focus on efficient techniques for processing XML streams. The topics that we have studied include how to efficiently evaluate XPath queries on XML streams, how to validate XML streams according to user specified constraints, how to encode the data in order to speed up the processing of encoded XML streams.

Publication

People
Prof. Susan Davidson   Dr. George Mihaila   Dr. Sriram Padmanabhan  Yi Chen Yifeng Zheng


XML Databases

Description
As XML has been a popular format for data representation, effective storage and efficient query processing of XML data is very important.  On the other hand,  relational databases have been optimized for performance through more than 30 years of development and  are highly reliable, scalable, and well established as the backend for data storage. We have developed storage and query evaluation techniques for XML data by leveraging  relational database technology.

When we transform the hierarchical structure of XML data to relational tables, we addressed two challenges. First, how to design the transformation so that the SQL queries generated from XML queries are efficient? Second, when the schema of XML data is available, how to design a normalized relational schema for data storage to ensure data correctness and avoid update anomalies?

Publications


XML Constraints

Description
We have studied various constraints of XML data, including keys, foreign keys and functional dependencies.  We investigated how to validate XML constraints when XML data is in its native form as a file or a stream, or stored in relational databases.  The constraints can also be enforced incrementally when updates are made to the XML data. Furthermore, we have studied how to use the constraint information to guide the schema design to ensure data correctness and to remove redundancy when we store the data in relational databases.

Publications

People
Prof. Susan Davidson  Prof. Carmem Hara Yi Chen Yifeng Zheng

Querying Linguistic Databases

Description
Describing and analyzing human languages depend on being able to manage large databases of annotated text and recorded speech. This project will apply research in relational and XML databases to linguistics, develop linguistic data models and query languages, and deploy them for creating, managing, analyzing, and displaying annotated linguistic databases.
Project web page:  http://www.ldc.upenn.edu/Projects/QLDB/

Publications

People
Prof. Steven Bird  Prof. Susan Davidson   Prof. Mark Liberman   Dr. Beatrice Santorini     Yi Chen  Baden Hughes   Catherine Lai   Haejoong Lee  Yifeng Zheng