The focus of this CAREER project is on techniques and applications of derived data maintenance. Derived data is the result of applying some transformation, structural or computational, to base data. The use of derived data to facilitate access to base data is a recurring technique in many areas of computer science. Examples of derived data include caches, replicas, indexes, materialized views, synopses, etc. Regardless of the varying forms, purposes, complexity, and accuracy of derived data, it must be maintained when base data is updated. Thus, derived data maintenance is a fundamental problem in computer science. It is also an evolving problem: existing techniques are constantly challenged by the explosive growth in data volume and number of data producers and consumers, and by increasing diversity in data formats.
Traditionally, derived data maintenance has been tackled separately in different contexts, e.g., index updates and materialized view maintenance in databases, cache coherence and replication protocols in distributed systems. Although they share the same underlying theme, these techniques have been developed and applied largely disjointly. Newer and more complex data management tasks, however, call for creative combinations of the traditionally separate ideas. Semantic caching, which has received tremendous interests recently for its applications in caching dynamic Web contents, is a good example of incorporating the idea of materialized views into a cache. With "outside-the-box" thinking such as semantic caching, we seek to discover more techniques that combine multiple flavors of derived data to provide better solutions to problems.
In the first year of this project, we have made progress on the following specific research problems: (1) caching for view maintenance; (2) caching for stream data processing; (3) caching for XML indexing; (4) incremental maintenance of XML structural indexes. A detailed description of our contributions can be found in the section on project impact.
We have made contributions to multiple application domains of derived data maintenance, including view maintenance, data warehousing, stream data processing, and XML indexing. A number of the contributions have been published in major database conferences: ICDE 2003, ICDE 2004, and SIGMOD 2004. Below is a detailed description of these contributions.
In terms of educational activities, we have incorporated current research topics into both undergraduate and graduate database courses at Duke University. The undergraduate database course offered in Fall 2003 introduced students to topics such as keyword search on relational databases, stream processing, and semantic Web. The graduate database course offered in Spring 2004 covered a substantial amount of material drawn from the latest research on XML; this course helped catapult my group's entry into the XML research community. All course materials are published on the Web and available to the public.
To extend the impact of this project beyond computer science teaching and research, I am collaborating with David F. Kong and others at the Duke University Medical Center. We are in the process of initiating a project to develop an integrated dataspace for biomedical research at Duke University. This dataspace will provide an easy interface for biomedical researchers to access integrated genetic, genomic, and proteomic data as well as clinical databases. Results from the proposed research, specifically view maintenance and XML indexing techniques, will be applied in the development of this integrated dataspace to improve its efficiency and performance.
The overall goal of this project is to develop, over time, a collection of reusable, composable techniques for derived data maintenance with well-understood performance tradeoffs that can be readily used by applications. We believe creative combinations of synergetic techniques from different research fields hold real promise in providing efficient solutions to many new data management challenges. In return, we hope these ideas will contribute back to these fields and make impact beyond the field of databases.
In the short term, we plan to continue investigation of derived data maintenance techniques for XML because of the urgent practical need for such techniques with the growing importance of XML. We are also considering the emerging application of network data querying. The initial study will focus on techniques for choosing nodes across the network to cache information from other nodes, in a way such that a query over a remote region can be answered by contacting relatively few nodes closer to the querying node. The final solution would require combining techniques from caching, replica placement, and approximate replication. This work is being done in collaboration with Amin Vahdat, a networking/distributed systems researcher at University of California at San Diego. Finally, we are working with David F. Kong and others at the Duke University Medical Center to develop an integrated dataspace for biomedical research, which will serve a testbed for some of our results.