CAREER: Techniques and Applications of Derived Data Maintenance

Project Award Number: IIS-0238386

Principal Investigator

Jun Yang
Department of Computer Science
Duke University
P. O. Box 90129
Durham, North Carolina 27708

Phone: 919-660-6587
Fax: 919-660-6519
Email: junyang@cs.duke.edu
Web: http://www.cs.duke.edu/~junyang/

Collaborator

Yuguo Chen
Institute of Statistics and Decision Science
Duke University
216 Old Chemistry Building
Durham, North Carolina 27708

Phone: 919-681-8443
Fax: 919-684-8594
Email: yuguo@stat.duke.edu
Web: http://www.isds.duke.edu/~yuguo/

Collaborator

Amin Vahdat
Department of Computer Science and Engineering
University of California at San Diego
9500 Gilman Drive, Dept. 0114
La Jolla, California 92093

Phone: 858-534-4614
Fax: 858-534-7029
Email: vahdat@cs.ucsd.edu
Web: http://www.cs.ucsd.edu/~vahdat/

Collaborator

David F. Kong
Division of Cardiology, Department of Medicine
Duke University Medical Center
2300-2399 Erwin Road
Durham, North Carolina 27710

Phone: 919-668-8946
Fax: 919-668-7058
Email: dkong@acpub.duke.edu
Web: http://www.duke.cardiologydomain.com/handler.cfm?event=practice,physicians,detail&physicianID=963D94A4-07B8-42B5-849E6948B4E65100

Collaborator

Ioana R. Stanoi
IBM T. J. Watson Research Center
Hawthorne 1
30 Saw Mill River Road
Hawthorne, New York 10532

Phone: 914-784-7028
Email: irs@us.ibm.com
Web: http://www.research.ibm.com/people/i/irs/

Keywords

derived data
materialized views
indexing
caching
incremental maintenance
XML

Project Summary

The focus of this CAREER project is on techniques and applications of derived data maintenance. Derived data is the result of applying some transformation, structural or computational, to base data. The use of derived data to facilitate access to base data is a recurring technique in many areas of computer science. Examples of derived data include caches, replicas, indexes, materialized views, synopses, etc. Regardless of the varying forms, purposes, complexity, and accuracy of derived data, it must be maintained when base data is updated. Thus, derived data maintenance is a fundamental problem in computer science. It is also an evolving problem: existing techniques are constantly challenged by the explosive growth in data volume and number of data producers and consumers, and by increasing diversity in data formats.

Traditionally, derived data maintenance has been tackled separately in different contexts, e.g., index updates and materialized view maintenance in databases, cache coherence and replication protocols in distributed systems. Although they share the same underlying theme, these techniques have been developed and applied largely disjointly. Newer and more complex data management tasks, however, call for creative combinations of the traditionally separate ideas. Semantic caching, which has received tremendous interests recently for its applications in caching dynamic Web contents, is a good example of incorporating the idea of materialized views into a cache. With "outside-the-box" thinking such as semantic caching, we seek to discover more techniques that combine multiple flavors of derived data to provide better solutions to problems.

In the first year of this project, we have made progress on the following specific research problems: (1) caching for view maintenance; (2) caching for stream data processing; (3) caching for XML indexing; (4) incremental maintenance of XML structural indexes. A detailed description of our contributions can be found in the section on project impact.

Publications and Products

  • Ke Yi, Hai Yu, Jun Yang, Gangqiang Xia, and Yuguo Chen. "Efficient Maintenance of Materialized Top-k Views." In Proceedings of the 19th International Conference on Data Engineering (ICDE '03), Bangalore, India, March 2003.
  • Zhihui Wang. "Multiple-View Maintenance with Semantic Caching." M.S. Thesis, Duke University, August 2003.
  • Junyi Xie, Jun Yang, and Yuguo Chen. "On Joining and Caching Stochastic Streams." Technical Report, Department of Computer Science, Duke University, November 2003.
  • Adam Silberstein and Jun Yang. "NeXSort: Sorting XML in External Memory." In Proceedings of the 20th International Conference on Data Engineering (ICDE '04), Boston, Massachusetts, March 2004.
  • Hao He and Jun Yang. "Multiresolution Indexing of XML for Frequent Queries." In Proceedings of the 20th International Conference on Data Engineering (ICDE '04), Boston, Massachusetts, March 2004.
  • Ke Yi, Hao He, Ioana Stanoi, and Jun Yang. "Incremental Maintenance of XML Structural Indexes." In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD '04), Paris, France, June 2004.

Project Impact

We have made contributions to multiple application domains of derived data maintenance, including view maintenance, data warehousing, stream data processing, and XML indexing. A number of the contributions have been published in major database conferences: ICDE 2003, ICDE 2004, and SIGMOD 2004. Below is a detailed description of these contributions.

  1. Caching for view maintenance. The traditional approach to making a view self-maintainable is to store enough auxiliary data so that maintenance never needs to access base data. This rigid notion often requires an enormous and potentially unbounded amount of auxiliary data. Our work on top-k view maintenance (published in ICDE 2003) has demonstrated a much more cost-effective approach of high-probability runtime self-maintenance, an idea inspired by caching. Chosen appropriately, a small amount of auxiliary data will satisfy the needs of most maintenance operations; occasional accesses to base data may still be needed, but the amortized maintenance cost is much lower than both complete self-maintenance and no self-maintenance at all. We have extended this idea to other views such as joins, and developed an auxiliary data cache for data warehousing. We have studied several cache organizations and replacement policies, and found that the most effective and flexible strategy is to allow the auxiliary data cache to use a different replacement policy for each instance of a base table in each view---a conclusion similar to Chou and DeWitt's work on buffer management strategies in VLDB 1985. Our work led to a Master's thesis (2003).
  2. Caching for stream data processing. Another topic we have investigated is operator state management in stream data processing. Answering continuous queries over streams can be seen as incrementally maintaining views given update streams without accessing base data. With limited memory, choosing what to keep in the state to maximize result completeness is analogous to making replacement decisions in an auxiliary data cache. Our recent work on joining and caching stochastic streams (technical report; submitted for publication) elucidates the difference and relationship between join state management for streams and the classic caching problem. Their difference implies that optimal classic caching policies do not carry over directly to joining streams. Despite the difference, we show the two problems can be tackled under a unified framework, and we develop techniques for deriving provably optimal replacement policies in some cases and good heuristics in others. Recognizing the importance of update constraints and limitation of hard constraints, we exploit statistical properties of streams---soft, probabilistic constraints---to optimize replacement decisions. It is tempting to give heuristic solutions without understanding when and why they work (or fail). We strive to follow a principled approach to validating solutions both experimentally and analytically. My collaboration with Yuguo Chen, a colleague at Duke statistics, has helped bring much rigor to the analysis of solutions.
  3. Caching for XML indexing. We have developed the M*(k)-index (published in ICDE 2004), an adaptive, multiresolution structural index for XML. A high-resolution index can support long path expression queries accurately, but it has a large size that can adversely affect query performance, especially for short path expressions. Our index can be adaptively and incrementally refined to support frequent queries, similar to a cache in spirit. It is multiresolution: not only can it index different parts of the XML data graph at different resolutions, but it also indexes the same part of the graph at multiple resolutions. This feature provides dramatic performance improvements over previous structural indexes. An understandable concern about the M*(k)-index is its size, since it maintains more information. Interestingly, it turns out to be more space-efficient than other incrementally refined indexes, because the extra information also helps with index refinement.
  4. Incremental maintenance of XML structural indexes. In our recent work on incremental maintenance of XML structural indexes (published in SIGMOD 2004), we show that this multiresolution approach also helps index maintenance with respect to base data updates. In some sense, the extra information plays an analogous role as auxiliary data in view maintenance. Previous algorithms for maintaining structural indexes provide no guarantees on index quality; as more updates are applied, they produce progressively larger indexes with degrading query performance. Our algorithms are efficient and always preserve the minimality of the structural indexes.

In terms of educational activities, we have incorporated current research topics into both undergraduate and graduate database courses at Duke University. The undergraduate database course offered in Fall 2003 introduced students to topics such as keyword search on relational databases, stream processing, and semantic Web. The graduate database course offered in Spring 2004 covered a substantial amount of material drawn from the latest research on XML; this course helped catapult my group's entry into the XML research community. All course materials are published on the Web and available to the public.

To extend the impact of this project beyond computer science teaching and research, I am collaborating with David F. Kong and others at the Duke University Medical Center. We are in the process of initiating a project to develop an integrated dataspace for biomedical research at Duke University. This dataspace will provide an easy interface for biomedical researchers to access integrated genetic, genomic, and proteomic data as well as clinical databases. Results from the proposed research, specifically view maintenance and XML indexing techniques, will be applied in the development of this integrated dataspace to improve its efficiency and performance.

Goals, Objectives and Targeted Activities

The overall goal of this project is to develop, over time, a collection of reusable, composable techniques for derived data maintenance with well-understood performance tradeoffs that can be readily used by applications. We believe creative combinations of synergetic techniques from different research fields hold real promise in providing efficient solutions to many new data management challenges. In return, we hope these ideas will contribute back to these fields and make impact beyond the field of databases.

In the short term, we plan to continue investigation of derived data maintenance techniques for XML because of the urgent practical need for such techniques with the growing importance of XML. We are also considering the emerging application of network data querying. The initial study will focus on techniques for choosing nodes across the network to cache information from other nodes, in a way such that a query over a remote region can be answered by contacting relatively few nodes closer to the querying node. The final solution would require combining techniques from caching, replica placement, and approximate replication. This work is being done in collaboration with Amin Vahdat, a networking/distributed systems researcher at University of California at San Diego. Finally, we are working with David F. Kong and others at the Duke University Medical Center to develop an integrated dataspace for biomedical research, which will serve a testbed for some of our results.

Area Background

Traditionally, derived data has been not an area by itself; instead, it has been investigated (largely independently) by many different research communities, e.g., caching, replication, indexing, view materialization, data warehousing, and data reduction. Since our research seeks to combine techniques developed for many different types of derived data, there is a large body of related work spanning many traditional research areas of computer science. Furthermore, our work is also related to general work in new areas where derived data can be applied, such as continuous query processing and XML data management.

Area References

As discussed in the section on area background, because of the nature of this research, our work is related to a large number of different research areas, which are difficult to summarize in a few references. A number of representative books (by no means a comprehensive list) include Materialized Views: Techniques, Implementations, and Applications, by Ashish Gupta and Inderpal Singh Mumick; Data on the Web: from relations to semistructured data and XML, by Serge Abiteboul, Peter Buneman, and Dan Suciu; Web Caching and Replication, by Michael Rabinovich and Oliver Spatscheck.

Potential Related Projects

Since we investigate applications of derived data maintenance techniques in XML indexing and stream/continuous query processing, our project is generally related to a number of research projects on XML and streams, e.g., view-design algorithms (NC State), TIMBER (U Michigan), STREAM (Stanford), XStreamCast (UT Arlington), etc. We welcome contacts from other researchers interested in collaborations.

Project Websites

http://www.cs.duke.edu/dbgroup/ddm/