W3C W3C Member Submission

Recommended Best Practices for
Data Enrichment Scenarios

W3C Member Submission 29 February 2016

This version:
http://www.inweb.org.br/w3c/dataenrichment//
Latest version:
http://www.inweb.org.br/w3c/dataenrichment/
Authors:
Adriano C. M. Pereira (InWeb/UFMG)
Adriano A. Veloso (InWeb/UFMG)
Gisele L. Pappa (InWeb/UFMG)
Wagner Meira Jr. (InWeb/UFMG)

Abstract

This document introduces a few tasks related to the data enrichment process, and suggests a set of best practices that should be followed when performing any of these tasks.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

By publishing this document, W3C acknowledges that the Submitting Members have made a formal Submission request to W3C for discussion. Publication of this document by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. This document is not the product of a chartered W3C group, but is published as potential input to the W3C Process. A W3C Team Comment has been published in conjunction with this Member Submission. Publication of acknowledged Member Submissions at the W3C site is one of the benefits of W3C Membership. Please consult the requirements associated with Member Submissions of section 3.3 of the W3C Patent Policy. Please consult the complete list of acknowledged W3C Member Submissions.

1. Introduction

Data enrichment refers to any process used to enhance, refine or improve raw data. Its main objective is to make data a valuable asset for modern business or enterprises, facilitating the decision making process.

The motivations behind the data enrichment process come from the huge amount of data originated by different Web applications, which may be further refined and integrated to suppress different user and business demands. Regardless the source or application from where the data come, data enrichment techniques have to account for different data characteristics, including volume, velocity and diversity as well as variety [1].

These are all characteristics of "big data" applications, which have to be taken into account by any technique aiming to perform data enrichment. Below we define each of them:

  1. Volume: the ''Big'' word in big data itself emphasizes the importance of the data volume. Currently, the existing data are in the order of petabytes (10 to the power of 15) and is expected to increase to zettabytes (10 to the power of 21) in the near future. Data volume measures the amount of data available to or within an organization, which does not necessarily own the data but needs to be able to access it.
  2. Data Velocity: deals with the processing rate of the data coming from various sources. This characteristic is not limited to the acquisition rate of incoming data, but also to the speed at which data flows and get aggregated within the data enrichment process.
  3. Data Variety: is a measure of the richness and diversity of the data representation - text, images, video, audio, etc. Data inherent to a given context do not necessarily belong to a single category, and dealing with all these types of data simultaneously is a challenge.

In order to deal with different data associated with Web applications, especially user-generated content (UGC), and to be able to effectively use these data, there are different types of computational tasks we may perform, and all these tasks need to account for the aforementioned data characteristics. These tasks are part of the data enrichment process, which needs to follow a set of recommended requirements.

This document briefly describes six common tasks involved in the data enrichment process, namely (i) fusion; (ii) entity recognition; (iii) disambiguation; (iv) segmentation; (v) imputation; and (vi) categorization (Section 2). Following, Section 3 presents a set of 7 recommended requirements that generate 7 best practices for data enrichment tasks in the context of data on the Web. Finally, Section 4 presents some conclusions.

2. Data Enrichment Tasks

This section presents a subset of six common and useful tasks involved in the data enrichment process, describing each of them. These tasks are inherited from a data-centric view, as illustrated by Figure 1.

Challenges

It is important to point out that these tasks are usually performed using sophisticated methods coming from the areas of data mining, machine learning, statistics, and natural language processing, among others. Although we do not expect data practitioners to know in detail how they were implemented, it is paramount that they understand the requirements these tasks have and how they deal with the characteristics of applications in the era of ''big data''.

Note that the tasks described here are independent, orthogonal, and may be applied more than once during the data enrichment process. As illustrated in Figure 2, the data enrichment process consists of applying one or more tasks sequentially, where the input data D are transformed in output data D'. This process may be cyclic, where each task performed adds new information that may become input for a task performed later, or may simply generate meta-data, which will be stored and used for future decision making. Next we describe these six representative tasks.

Data Enrichment Process

2.1 Data Fusion

Data fusion is the process of integrating multiple data representing the same real-world object into a consistent, accurate, and useful representation. Thus, this is an important task of the data enrichment process, as it allows concentrating information initially spread about an object into a single place.

In general, all tasks that demand any type of parameter estimation from multiple sources may benefit from the use of data/information fusion methods. The terms information fusion and data fusion are typically employed as synonyms; but, in some scenarios, the term data fusion is used for raw data (e.g., obtained directly from sensors) and the term information fusion is employed to define preprocessed data. In this sense, the term information fusion implies a higher semantic level than data fusion. Other terms associated with data fusion that typically appear in the literature include decision fusion, data combination, data aggregation, multi-sensor data fusion, and sensor fusion [2].

2.2 Named Entity Recognition

The named entity recognition task labels sequences of words from a text. More specifically, the task locates and identifies names of people, companies, organizations, cities and other predefined types of entities. The problem is usually modeled as a classification problem, with well engineered features and sophisticated classifiers [3].

Features are usually extracted using natural language processing techniques, and consist of neighbor words, part-of-speech tags, neighbor entity labels, and word shapes and substrings. State-of-the-art classifiers are usually based on Conditional Markov Models, that is, the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions.

2.3 Data Disambiguation

The integration of textual data from different information extraction systems often requires to disambiguate entity mentions in the text. Disambiguation is necessary due to non-uniform variations and ambiguity in entity names. The disambiguation task may be modeled as a classification problem. In this case, a named entity disambiguation classifier is trained using specific features which are usually obtained from repositories such as DBPedia and Freebase.

The input of a disambiguation classifier is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text.

2.4 Segmentation

The task of segmentation refers to group data according to a set of desired and perhaps predefined characteristics. For example, nowadays we deal with large amounts of data coming from different social networks. During the data analysis process, knowing some user characteristics, such as his or her gender, age location or social class, may help extracting useful information in the context being analyzed. Such strategy is valid not only for users, but also for other types of entities. Given any other type of data entity and a set of features we are interested in, the task of segmentation employsdata mining techniques, such as clustering, to better understand the data.

However, in many cases the most difficult task in segmentation is not to add these new analysis' dimensions to perform the segmentation, but to identify the feature of interest in the data [4,5,6]. For example, in order to determine the gender of a user, if it is not given, we may use a suitabletechnique to infer this attribute, and then add to the data for a more sophisticated data analysis.

Most of the cost of data segmentation is in the inference phase of the attributes of interest. This phase usually requires the use of natural language processing and text processing techniques, already explored in the literature.

2.5 Data Imputation

Data imputation is the process of estimating values for missing or inconsistent data items (fields). This is extremely important when the data being collected is going to be used for data characterization or for generating models from data.

In statistics, imputation is the process of replacing missing data with estimated values. When replacing a single data point, it is known as "unit imputation"; when replacing a component of a data point, it is known as "item imputation". Because missing data may create problems for analyzing data, imputation is seen as a way to avoid pitfalls associated with deletion of missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discard any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with a likely value based on other available information. Once all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data [7].

2.6 Data Categorization

One of the tasks to be considered at early enrichment stages is data categorization. Data categorization refers to the task of manually or automatically labeling data according to different types of categories, including (i) the topics present in a document, e.g., football or politics, (ii) the sentiment or opinion engraved in a sentence, e.g., positive or negative, or (iii) any other type of category, including whether text related to a real time event.

Next we present two different types of categorization: topification and sentiment analysis.

2.6.1 Topification

This section briefly describes a task known as topic discovering, extraction or identification. In topic modeling, a data set (which can be documents, images, social network messages, among others) is grouped according to their content, based solely on unlabeled data [8,9,10].

Topic discovering methods are among the most explored to extract information from large amounts of data. They were conceived to find semantically meaningful topics from a document corpus and are usually based on one of the following approaches: (i) clustering, which includes traditional data mining algorithms applied to textual data; (ii) probabilistic, such as Latent Dirichlet Allocation (LDA), where a generative model allows explaining sets of observations by the similarity inherent to some parts of the data [9] and (iii) non-probabilistic, which generates good-quality topics, regardless of vocabulary overlap.

In the process of data enrichment, methods such as LDA may be used to extract semantic topics from text. These topics are represented by a set of words that, together, expresses the topic contained in a document. With the help of a specialist, semantics may be extracted from these sets of words for a qualitative evaluation. Otherwise, the information that two documents refer to the same topic may be used for data labeling.

2.6.2 Sentiment Analysis

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information from source materials. It is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining [11].

In fact, research in sentiment analysis has spread outside of Computer Science to the Management and Social Sciences due to its importance to business and society as a whole. The growing importance of sentiment analysis coincides with the growth of social media such as reviews, forum discussions, blogs, micro-blogs, Twitter, and social networks. For the first time in human history, we now have a huge volume of opinionated data recorded in digital form for analysis [12].

We may then use labels generated by sentiment analysis techniques, such as positive, negative or neutral, to enhance the quality of raw Web data and, again, improve the decision making process.

3. Properties of Data Enrichment Tasks

This section describes a set of desired properties of the data enrichment process. These properties are instantiated as recommended Best Practices (BP) when performing any task related to data enrichment. They are:

  1. Reproducibility: Each data enrichment task MUST be reproducible.
  2. Evaluation Criterion: Each data enrichment task MUST have an evaluation criterion.
  3. Scalability: Each data enrichment task SHOULD be scalable.
  4. Completeness: Each data enrichment task MUST be complete with respect to the input domain.
  5. Consistency: The output of the data enrichment task MUST be consistent with the input and the task goals.
  6. Cost viability: The data enrichment task MUST meet the specification deadlines.
  7. Generality: The data enrichment task SHOULD be applicable to different data types and applications scenarios.

For all BPs defined, we present the following information: