Project Description








Project Description

Background and state-of-the-art

We consider online communities in the general sense of sharing common interests or purposes through data. Community members can be private users volunteering their time for a common project. They can also be professionals (researchers, engineers, support staff, etc.) who use web-scale collaboration in their workplace within or across their organizations. Therefore, the advantages of mass collaboration such as faster production and better accuracy of knowledge and data can be brought to all kinds of companies and, for instance, help them create better services and products faster at lower cost. To illustrate the data management requirements in this context, let us introduce two representative, rather complex examples of online community applications: collaborative medical research and social networking systems.

  Collaborative medical research. Medical research is a highly collaborative process, involving multiple organizations (research laboratories, hospitals, pharmacy companies, etc.) and multiple participants with different levels of expertise (patient volunteers, medical scientists, biologists, pharmacists, physicians, pathologists, surgeons, etc.). As an example of complex collaboration scenario, an extensive study of the course of a disease over a number of years would require integrating data about a population of selected patients (with their diagnoses, family histories, therapies, etc) and matching them with other data (pathology data, oncology data, etc.). Other scenarios may involve simpler, shorter collaboration, e.g. a group of pathologists trying to diagnose a given patient.

  Vast amounts of medical data are produced continuously with various levels of accessibility (patient records, drug studies, epidemiologic studies, genomic sequences, etc.). Although most of the data is stored in medical information systems, these are isolated, do not interoperate and do not provide support for collaboration. Thus, the typical way of collaborating is that the participants copy and directly exchange data using files (e.g. spreadsheets) which makes data integration and data analysis time-consuming (e.g. when needing a high number of files from different participants) and error prone (e.g. when merging data from spreadsheets of different formats). Furthermore, data copying may violate patient privacy regulations.

    A good analysis of the general requirements of a Computer Supported Cooperative Work (CSCW) for medical research collaboration is given in [Sta+08]. From the point of view of data management, we can derive the following requirements: transparent data access with query capabilities (join, transformation), update support for shared data (e.g. adding annotations to experimental results, relating data sources), support for dynamic groups of participants, data privacy with role management (different participants have different roles and thus different access rights on the data). Furthermore, data uncertainty must be supported, e.g. to deal with observations made with tools of different accuracy.

    Some of these requirements (except data uncertainty and data privacy) can be addressed by building a CSCW using an existing database as proposed in [Sta+08]. But this solution has the traditional drawbacks of centralized systems: single point of failure, heavy administration of global information, latency for remote users. Furthermore, it is ill suited to enable dynamic groups to form quickly in order to perform fast, short lasting collaboration (e.g. when a patient's life is involved). Finally, data privacy may be compromised as a result of copying sensitive data in the central database.

    We claim that a P2P solution is the right solution in this case as it is light-weight in terms of administration and can scale up easily. Peers can be the participants or organizations involved in collaboration and may keep full control over their data. Furthermore, data replication can be exploited to increase data availability and foster parallel work.    

    Social networks. A social network such as Facebook enables its users to share personal information stored in a central repository. It is rather straightforward to develop new applications using a simple API. However, these applications are rather limited in scope. We claim that there are two fundamental flaws in this setting. The first one is that it is technically rather inefficient to centralize all the data and the control in a system that is bound to become a bottleneck or end up wasting enormous resources (the Facebook farm). More importantly, many users are reluctant to give full control over their private data to a provider (Facebook) who can sell it to other businesses and worse, leave such control to third parties of unknown affiliations.

    In this case again, a P2P solution where peers are social participants is promising for data management as it allows better control over personal data privacy (in some proxy database) by their owners. Then a user can interact with the system with an interface in the style of mashups. A proxy handles her data and the interaction with the community. Note that with such an approach, a user can have her own data (e.g., phone number, list of trusted friends) shared between many systems (Myspace, GoogleMail, Flickr, etc.) rather than replicated and inconsistent on the private servers of these systems. Finally, more advanced data management capabilities (e.g. queries, replication, etc.) could be provided and increase significantly the scope of social networks (e.g. enable large-scale collaboration of social participants).

Summary of Requirements
As many other online community applications, these two applications have common requirements (e.g. high level data access, data privacy,) and differences. For instance, collaborative medical research may be quite demanding in terms of quantity of data exchanged while social networks may involve very high numbers of participants. A P2P architecture provides important advantages like decentralized control and administration, scale up to high numbers of peers and support of the dynamic behaviour of peers (who may join or leave the system at will). These advantages are important for online communities. In addition, we have the following requirements for data management:

  •     Data uncertainty. Some data should not be assumed to be 100% certain, precise or correct, in particular, when coming from peers with different confidence. Data uncertainty should be supported at all levels of data management: schema management, semantic data descriptions, query processing, replication and privacy.
  •     Semantic data integration. Users should be able to access a set of data sources using their own semantic descriptions (e.g. ontologies) or annotations. For this purpose, the system should provide a mapping discovery service that uses an automatic and incremental process. This process should be self configurable and efficient.
  •    Query expressiveness. The query language should allow users to describe the desired data at the appropriate level of detail. For structured data, an SQL-like query capability is necessary. It should provide the ability to rank results and deal with uncertainty. Keyword search as with search engines can also be provided on top an SQL-like query facility for simple queries [CHZ05].
  •     Update, change control, replication. Data should be replicated to improve availability despite peers’ failures and to improve performance of mass collaboration. Since the data can be updated in parallel by different peers, data reconciliation must be supported. The management and the surveillance of changes are also major challenges.
  •     Data privacy and trust. P2P data sharing systems pay little attention to data privacy and trust among participants. This has been exploited to avoid centralized control (and violate copyright law). But for collaboration among professionals with sensitive data (as in collaborative medical research), data privacy and trust among participants are major requirements.

State of the Art

The state of the art useful to the DataRing project is related to recent extensions of database systems, data integration systems, P2P data sharing systems and semantic web

  Database systems. For a long time, the research agenda of the database research community has been to provide advanced database system capabilities for emerging applications of information systems. Some recent work in database systems is related to DataRing: support of top-k queries, support of data uncertainty and data privacy.

    Top-k queries enable users to rank their results based on a scoring function as in search engines but on structured data (e.g. with SQL syntax). The first important work is [FLN03] which models the general problem of answering top-k queries using lists of data items sorted by their local scores and proposes a simple, yet efficient algorithm, Fagin's algorithm (FA). A better algorithm over sorted lists is the Threshold Algorithm (TA) [FLN03]. TA has been the basis for many extensions in distributed database systems. Recently, Best Position Algorithms (BPA) [APV07a] demonstrated significant and consistent performance improvement over TA. We plan to capitalize on this work in DataRing to support more general forms of flexible querying in a P2P environment.

    Data uncertainty in DBMS has recently received attention in order to deal with data extracted from data sources of various qualities (e.g. scientific data, commercial data). An important project is Trio-One at Stanford [ABD+06] which aims at providing data uncertainty and lineage in an integrated manner in a DBMS. This is done by extending the relational model and SQL with several constructs, in particular, numeric confidence values, optionally attached to tuples. Confidence values represent the degree of certainty and respect a probabilistic interpretation as in probabilistic databases [DS05], i.e. the certainty about the correctness of data is the probability that the data is correct. Trio-One is built on top of relational DBMS using data and query translation techniques and stored procedures. Another important approach to deal with imprecise data is using fuzzy logic, where data values range over a user-defined vocabulary. It has been used successfully to build user-oriented database summaries [SRM05]. Probabilistic and fuzzy databases are two complementary approaches which we plan to explore in DataRing, but in a different context (P2P).

     As data about individuals and organizations can be easily disclosed and collected on the web, data privacy is becoming a major issue. A basic principle of data privacy is purpose specification which states that data providers should be able to specify the purpose for which their data will be collected and used. Hippocratic databases provide mechanisms for enforcing purpose-based disclosure control within a database [AKS+02]. This is achieved by using privacy metadata, i.e. privacy policies and privacy authorizations stored in relational tables. In the context of P2P systems, decentralized control makes it hard to enforce purpose-based privacy which remains an open problem.

     Data integration systems. Data management in distributed systems has been traditionally achieved by distributed database systems [ÖV99] which enable users to transparently access and update multiple databases in a network using a high-level query language (e.g. SQL). Transparency is achieved through a global schema which hides the local databases’ heterogeneity. In its simplest form, a distributed database system is a centralized server that supports a global schema and implements distributed database techniques (query processing, transaction management, consistency management, etc.). This approach has proved effective for applications that can benefit from centralized control and full-fledge database capabilities, e.g. information systems. However, it cannot scale up to more than tens of databases.  Data integration systems, e.g. DISCO [TRV98] extend the distributed database approach to access autonomous data sources (such as files, databases, documents, etc.) on the web with a simpler query language in read-only mode. However, data integration systems typically do not support important data management functions such as replication and updates, which our target collaborative applications require. Recent work on data integration systems has dealt with XML schema matching [DBH07].

    Dataspaces [FHM05] go one step further than data integration systems by relaxing the needs for a global schema and providing data management functionality over all data sources, regardless of how they are integrated. One basic function is keyword search which does not require any integration at all. However, for richer SQL-like querying over some data sources, an additional integration effort is needed, following an incremental, “pay-as-you-go” principle. The vision of a Data Ring in [AP07] which we adopt in this project focuses on a high-level, easy-to-use dataspace for content sharing communities and emphasizes declarative querying with data exchanged in a high-level format (e.g. XML). The MetaQuerier project at Yahoo! Research [CHZ05] adopts an extreme approach to web-scale data integration by automatically creating a unified interface to deep-web sources in specific semantic domains. The PayGo project at Google [MCD+07] represents a major effort to realize the vision of dataspaces and emphasizes pay-as-you-go as a means to achieve web-scale integration of structured data including deep-web sources and sites like Google Base. Besides the “pay-as-you-go” incremental fashion of improving semantic data integration, PayGo proposes new components to go beyond the state-of-the-art in data integration: management of approximate mappings, support of keyword queries, heterogeneous result ranking, and support of uncertainty for data mappings and queries.

     P2P data sharing systems. P2P techniques which focus on scaling, dynamicity, autonomy and decentralized control can be very useful to online communities. Initial research on P2P systems has focused on improving the performance of query routing in the unstructured systems which rely on flooding. This work led to structured solutions based on distributed hash tables (DHT) or hybrid solutions with superpeers that index subsets of peers. Recent work on P2P data management has concentrated on supporting semantically rich data (e.g., XML documents, relational tables, etc.) using a high-level query language and distributed database capabilities (mostly schema management and query processing), e.g. ActiveXML [ABC+03], Appa [AMP+06]. Somewhere [RAC+06] enables more semantic integration of web data using ontologies. PeerSum [HRV+08] is a first attempt at building summaries over P2P data. Work on update support in P2P has started only recently [APV07, MPE+08]. Privacy is considered a critical issue in such systems. For instance, in social networks, users are very concerned by leaks of their private data. P2P systems have to provide access control mechanisms of the same quality as in centralized systems. More precisely, data owners should have the means to control access (in read or right mode) to their contents.  This issue is challenging in P2P settings and should rely on sophisticated encryption techniques such as [ACF+06], privacy techniques such as [HSV08] and new trust models for P2P such as [NCR08].

    Semantic web. The semantic web now provides a simple data expression language (RDF) with a powerful ontology language (OWL) and associated query language (SPARQL). The amount of available information expressed in these languages is rapidly increasing. Because RDF has been designed from scratch for distributed use and integration, it is well suited to P2P data integration. Furthermore, using ontologies instead of schemas, provides a flexible way to specialize data semantics for specific purposes: each peer having different interests and different capabilities can adapt ontologies to its purposes. Ontology reconciliation, and thus interoperability, can be obtained through ontology matching [ES07]. However, research on semantic P2P systems [HB04, SS06] has considered so far ontologies which do not evolve. One challenge in DataRing is to serendipitously use ontology alignments in a P2P environment.


[ABC+03] S. Abiteboul, A. Bonifati, G. Cobena, I. Manolescu, T. Milo. Dynamic XML Documents with Distribution and Replication. ACM SIGMOD conf., 2003.

[ACF+06] S. Abiteboul, B. Cautis, A. Fiat, T. Milo. Digital Signatures for Modifiable Collections. Int. Conf. on Availability, Reliability and Security (ARES), 390-399, 2006.

[AP07] S. Abiteboul, N. Polyzotis. The Data Ring: Community Content Sharing. Conference on Innovative Data Systems Research (CIDR), 154-163, 2007.

[ABD+06] P. Agrawal, O. Benjelloun, A. Das Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty, and Lineage. Int. Conf. on Very Large Databases (VLDB), 1151-1154, 2006.

[AKS+02] R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. Int. Conf. on Very Large Databases (VLDB), 143-154, 2002.

[AMP+06] R. Akbarinia, V. Martins, E. Pacitti, P. Valduriez. Design and Implementation of APPA. Global Data Management (Eds. R. Baldoni, G. Cortese, F. Davide), IOS Press, 2006.

[APV07] R. Akbarinia, E. Pacitti, P. Valduriez. Data Currency in DHTs.  ACM SIGMOD conf, 211-222, 2007.

[APV07a] R. Akbarinia, E. Pacitti, P. Valduriez. Best Position Algorithms for Top-k Queries. Int. Conf. on Very Large Databases (VLDB), 495-506, 2007.

[CHZ05] K.C-C. Chang, B. He, Z. Zhang. Toward Large Scale Integration: Building a MetaQuerier over Databases on the web. Conference on Innovative Data Systems Research (CIDR), 44-55, 2005.

[DS05] N.N. Dalvi, D. Suciu. Answering Queries from Statistics and Probabilistic Views. Int. Conf. on Very Large Databases (VLDB), 805-816, 2005.

[DBH07] F. Duchateau, Z. Bellahsene, E. Hunt. XBenchMatch: a Benchmark for XML Schema Matching Tools. Int. Conf. on Very Large Databases (VLDB), 1318-1321, 2007.

[ES07] J. Euzenat, P. Shvaiko. Ontology Matching, Springer-Verlag, Heidelberg (DE), 2007

[FLN03] R. Fagin, J. Lotem, M. Naor. Optimal Aggregation Algorithms for Middleware. Journal of Computer and System Sciences 66(4): 614-656, 2003.

[FHM05] M. Franklin, A. Halevy, D. Maier. From databases to dataspaces: a new abstraction for information management. ACM SIGMOD Record 34(4): 27-33, 2005.

[HRV+08] R. Hayek, G. Raschia, P. Valduriez, N. Mouaddib. Summary Management in P2P Systems. Int. Conf. on Extending Database Technology (EDBT), 2008.

[HB04] P. Haase, B. Schnizler et al. Bibster - A Semantics-Based Bibliographic Peer-to-Peer System. Journal of Web Semantics 2(1):122-136, 2004.

[NCR08] G. H. Nguyen, P. Chatalic, M-C. Rousset: A probabilistic trust model for semantic peer to peer systems. Int. Workshop on Data Management in Peer-to-Peer Systems (DaMaP), 59-65, 2008.

[HSV08] M. Jawad, P. Serrano-Alvarado, P. Valduriez. Design of PriServ, a privacy service for DHTs. Int. Workshop on Privacy and Anonymity in Information Society, PAIS), 21-25, 2008.

[MCD+07] J. Madhavan, S. Cohen, X.L. Dong, A.Y. Halevy, S.R. Jeffery, D. Ko, Cong Yu. Web-scale Data Integration: You can afford to Pay as You Go. Conference on Innovative Data Systems Research (CIDR), 342-350, 2007.

[MPE+08] V. Martins, E. Pacitti, M. El Dick, R. Jimenez-Peris. Scalable and Topology-Aware Reconciliation on P2P Networks. Distributed and Parallel Databases, to appear, 2008.

[ÖV99] T. Özsu, P. Valduriez. Principles of Distributed Database Systems. 2nd Edition, Prentice Hall, 1999 (3rd Edition forthcoming).

[RAC+06] M-C. Rousset, P. Adjiman, P. Chatalic, F. Goasdoue, L. Simon. SomeWhere: A Scalable Peer-to-Peer Infrastructure for Querying Distributed Ontologies. Int. Conf. on Ontologies, Databases and Applications of Semantics (ODBASE), 698-703, 2006.

[SA07] P. Senellart, S. Abiteboul. On the complexity of managing probabilistic XML data. Symposium on Principles of Database Systems (PODS), 283-292, 2007.

[SRM05] R. Saint-Paul, G. Raschia, N. Mouaddib. General Purpose Database Summarization. Int. Conf. on Very Large Databases (VLDB), 733-744, 2005.

[Sta+08] K. Stark, et al. GATiB-CSCW, Medical Research Supported by a Service-Oriented Collaborative System. Int. Conf. on Advanced Information Systems Engineering (CAiSE), 2008.

[SS06] S. Staab, H. Stuckenschmidt, Semantic web and peer to peer, Springer-Verlag, Heildelberg (DE), 2006

[TRV98] A. Tomasic, L. Raschid, P. Valduriez. Scaling access to heterogeneous data sources with DISCO. IEEE Trans. on Knowledge and Data Engineering, 10(5), 808-823, 1998.