Restricted Area  |

Web Data Management

In the traditional Web setting, it is an inherently difficult to access data that are included in pages and that are not readily available for retrieval. This difficulty is mainly caused by the unstructured nature of Web objects, for which access and retrieval are only usually possible (with limitations) through search engines such as Google, Yahoo!, and MSN.

However, a large portion of the Web is made up of pages that can be seen as data containers, since they implicitly include data that can be independently identified, extracted, and manipulated. Examples of such pages can be found in Web sites such as bookstores, online stores, travel agencies, classified ads, among others. Such “data-rich pages” [52] require more than efficient selection and retrieval methods; there are questions on how to adequately manipulate their data. Actually, this problem leads to an interesting paradox since, although widely available, Web data cannot be easily queried and manipulated as in a traditional database. Even after the creation of standards such as XML2 and RDF3, which provide some sort of structure to the Web, most content is still available as HTML pages, either static or dynamically created.

The term Web Data Management (WDM) has been recently used [19, 58, 78] to refer to the study of problems related to the collection, extraction, modeling, querying, storage, transformation, and integration of data available on the Web. In the last few years, these problems have been the focus of growing interest from the scientific community, not only because of the technical and scientific challenges they represent, but particularly due to the demand from industry for their effective solution.

Therefore, WDM techniques are crucial to tackle various problems connected to Challenge 2. For instance, these techniques are fundamentally important to the creation of agents that are capable of navigating the so-called Hidden Web, filling out forms and collecting the resulting pages, which contain data of interest [80], as required in goal 2.1. WDM techniques are also important to treat several problems related to goals 2.5, 2.6, 2.7, 2.8, and 2.10, including, for instance, the extraction of bibliographic references [116] for the study of co-authorship networks (goal 2.6) and the recognition of references to places [108] (goals 2.1, 2.3, and 2.8). Other examples include the recognition of objects for the identification of document versions [30, 50] (goal 2.7) and for the development of keyword-based search interfaces for fast integration of Web data sources [26, 85] (goal 2.7).

The Web Data Management (WDM) research line will be carried out, primarily, by the researchers Alberto Laender (UFMG), Clodoveu Davis (UFMG), Altigran Silva (UFAM), Mirella Moro (UFMG), Marcos Gonçalves (UFMG), Evandrino Barros (CEFET-MG), João Cavalcanti (UFAM), José Palazzo (UFRGS), Carlos Heuser (UFRGS), and Renata Galante (UFRGS).

Copyright © 2010 InWeb - Instituto Nacional de Ciência e Tecnologia para a Web - All rights reserved.
XHTML 1.1 OKXHTML 1.1 CSS 2.1 OKCSS 2.1 razz