Copyright @ 2016 InWeb/UFMG. This document is available under the W3C Document License. See the W3C Intellectual Rights Notice and Legal Disclaimers for additional information.
This document introduces a few tasks related to the data enrichment process, and suggests a set of best practices that should be followed when performing any of these tasks.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
By publishing this document, W3C acknowledges that the Submitting Members have made a formal Submission request to W3C for discussion. Publication of this document by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. This document is not the product of a chartered W3C group, but is published as potential input to the W3C Process. A W3C Team Comment has been published in conjunction with this Member Submission. Publication of acknowledged Member Submissions at the W3C site is one of the benefits of W3C Membership. Please consult the requirements associated with Member Submissions of section 3.3 of the W3C Patent Policy. Please consult the complete list of acknowledged W3C Member Submissions.
Data enrichment refers to any process used to enhance, refine or improve raw data. Its main objective is to make data a valuable asset for modern business or enterprises, facilitating the decision making process.
The motivations behind the data enrichment process come from the huge amount of data originated by different Web applications, which may be further refined and integrated to suppress different user and business demands. Regardless the source or application from where the data come, data enrichment techniques have to account for different data characteristics, including volume, velocity and diversity as well as variety .
These are all characteristics of "big data" applications, which have to be taken into account by any technique aiming to perform data enrichment. Below we define each of them:
In order to deal with different data associated with Web applications, especially user-generated content (UGC), and to be able to effectively use these data, there are different types of computational tasks we may perform, and all these tasks need to account for the aforementioned data characteristics. These tasks are part of the data enrichment process, which needs to follow a set of recommended requirements.
This document briefly describes six common tasks involved in the data enrichment process, namely (i) fusion; (ii) entity recognition; (iii) disambiguation; (iv) segmentation; (v) imputation; and (vi) categorization (Section 2). Following, Section 3 presents a set of 7 recommended requirements that generate 7 best practices for data enrichment tasks in the context of data on the Web. Finally, Section 4 presents some conclusions.
This section presents a subset of six common and useful tasks involved in the data enrichment process, describing each of them. These tasks are inherited from a data-centric view, as illustrated by Figure 1.
It is important to point out that these tasks are usually performed using sophisticated methods coming from the areas of data mining, machine learning, statistics, and natural language processing, among others. Although we do not expect data practitioners to know in detail how they were implemented, it is paramount that they understand the requirements these tasks have and how they deal with the characteristics of applications in the era of ''big data''.
Note that the tasks described here are independent, orthogonal, and may be applied more than once during the data enrichment process. As illustrated in Figure 2, the data enrichment process consists of applying one or more tasks sequentially, where the input data D are transformed in output data D'. This process may be cyclic, where each task performed adds new information that may become input for a task performed later, or may simply generate meta-data, which will be stored and used for future decision making. Next we describe these six representative tasks.
Data fusion is the process of integrating multiple data representing the same real-world object into a consistent, accurate, and useful representation. Thus, this is an important task of the data enrichment process, as it allows concentrating information initially spread about an object into a single place.
In general, all tasks that demand any type of parameter estimation from multiple sources may benefit from the use of data/information fusion methods. The terms information fusion and data fusion are typically employed as synonyms; but, in some scenarios, the term data fusion is used for raw data (e.g., obtained directly from sensors) and the term information fusion is employed to define preprocessed data. In this sense, the term information fusion implies a higher semantic level than data fusion. Other terms associated with data fusion that typically appear in the literature include decision fusion, data combination, data aggregation, multi-sensor data fusion, and sensor fusion .
The named entity recognition task labels sequences of words from a text. More specifically, the task locates and identifies names of people, companies, organizations, cities and other predefined types of entities. The problem is usually modeled as a classification problem, with well engineered features and sophisticated classifiers .
Features are usually extracted using natural language processing techniques, and consist of neighbor words, part-of-speech tags, neighbor entity labels, and word shapes and substrings. State-of-the-art classifiers are usually based on Conditional Markov Models, that is, the classifier makes a single decision at a time, conditioned on evidence from observations and previous decisions.
The integration of textual data from different information extraction systems often requires to disambiguate entity mentions in the text. Disambiguation is necessary due to non-uniform variations and ambiguity in entity names. The disambiguation task may be modeled as a classification problem. In this case, a named entity disambiguation classifier is trained using specific features which are usually obtained from repositories such as DBPedia and Freebase.
The input of a disambiguation classifier is a set of ambiguous entities. For each ambiguous entity, it is given a set of candidate entities. Then, the features are used to train the classifier, which learns to disambiguate entities in the text.
The task of segmentation refers to group data according to a set of desired and perhaps predefined characteristics. For example, nowadays we deal with large amounts of data coming from different social networks. During the data analysis process, knowing some user characteristics, such as his or her gender, age location or social class, may help extracting useful information in the context being analyzed. Such strategy is valid not only for users, but also for other types of entities. Given any other type of data entity and a set of features we are interested in, the task of segmentation employsdata mining techniques, such as clustering, to better understand the data.
However, in many cases the most difficult task in segmentation is not to add these new analysis' dimensions to perform the segmentation, but to identify the feature of interest in the data [4,5,6]. For example, in order to determine the gender of a user, if it is not given, we may use a suitabletechnique to infer this attribute, and then add to the data for a more sophisticated data analysis.
Most of the cost of data segmentation is in the inference phase of the attributes of interest. This phase usually requires the use of natural language processing and text processing techniques, already explored in the literature.
Data imputation is the process of estimating values for missing or inconsistent data items (fields). This is extremely important when the data being collected is going to be used for data characterization or for generating models from data.
In statistics, imputation is the process of replacing missing data with estimated values. When replacing a single data point, it is known as "unit imputation"; when replacing a component of a data point, it is known as "item imputation". Because missing data may create problems for analyzing data, imputation is seen as a way to avoid pitfalls associated with deletion of missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discard any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with a likely value based on other available information. Once all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data .
One of the tasks to be considered at early enrichment stages is data categorization. Data categorization refers to the task of manually or automatically labeling data according to different types of categories, including (i) the topics present in a document, e.g., football or politics, (ii) the sentiment or opinion engraved in a sentence, e.g., positive or negative, or (iii) any other type of category, including whether text related to a real time event.
Next we present two different types of categorization: topification and sentiment analysis.
This section briefly describes a task known as topic discovering, extraction or identification. In topic modeling, a data set (which can be documents, images, social network messages, among others) is grouped according to their content, based solely on unlabeled data [8,9,10].
Topic discovering methods are among the most explored to extract information from large amounts of data. They were conceived to find semantically meaningful topics from a document corpus and are usually based on one of the following approaches: (i) clustering, which includes traditional data mining algorithms applied to textual data; (ii) probabilistic, such as Latent Dirichlet Allocation (LDA), where a generative model allows explaining sets of observations by the similarity inherent to some parts of the data  and (iii) non-probabilistic, which generates good-quality topics, regardless of vocabulary overlap.
In the process of data enrichment, methods such as LDA may be used to extract semantic topics from text. These topics are represented by a set of words that, together, expresses the topic contained in a document. With the help of a specialist, semantics may be extracted from these sets of words for a qualitative evaluation. Otherwise, the information that two documents refer to the same topic may be used for data labeling.
Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information from source materials. It is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining .
In fact, research in sentiment analysis has spread outside of Computer Science to the Management and Social Sciences due to its importance to business and society as a whole. The growing importance of sentiment analysis coincides with the growth of social media such as reviews, forum discussions, blogs, micro-blogs, Twitter, and social networks. For the first time in human history, we now have a huge volume of opinionated data recorded in digital form for analysis .
We may then use labels generated by sentiment analysis techniques, such as positive, negative or neutral, to enhance the quality of raw Web data and, again, improve the decision making process.
This section describes a set of desired properties of the data enrichment process. These properties are instantiated as recommended Best Practices (BP) when performing any task related to data enrichment. They are:
For all BPs defined, we present the following information:
Each data enrichment task MUST be reproducible.
Why: A data enrichment task must always produce output that presents the same properties and characteristics, for a given input and set of parameters.
Intended Outcome: It should be possible to reproduce the task outcomes and to validate someone else results.
Possible Approach to Implementation: Different types of techniques may be used to guarantee reproducibility in a data enrichment task, including:
How to Test: For each input data and technique pair, successive executions must generate output data that match with respect to properties and characteristics.
Each data enrichment task MUST have an evaluation criterion.
Why: It must be possible to evaluate the success of a data enrichment task or the quality of its outcome.
Intended Outcome: The effective use of enriched data demands a clear and objective assessment of its goodness.
Possible Approach to Implementation:
The evaluation criterion may be:
How to Test: Each evaluation criterion may be tested by manual validation or automated techniques, such as statistical significance testing or hypothesis testing.
Each data enrichment task SHOULD be scalable
Why: Because it must be able to handle a growing amount of data or workload in a capable manner or its ability to be enlarged to accommodate that growth.
Intended Outcome: The effective use of enriched data demands the ability to complete the task of growing amount of data.
Possible Approach to Implementation: In order to achieve scalability we recommend:
How to Test: Provide different data sets for testing, varying their sizes and the number of attributes, and measuring the performance of the tasks.
Each data enrichment task MUST be complete with respect to the input domain.
Why: A data enrichment task must always be able to process any data from its input domain, producing results with the same properties and characteristics.
Intended Outcome: It should be possible to reuse the same data enrichment task over different data instances of the same application domain.
Possible Approach to Implementation: Different pairs of input data instances in the same input domain should be compared using relative evaluation criteria, and the results obtained should be the same w.r.t. their properties and characteristics. Pairs of data input instances should preferably represent different properties of the input data (e.g. missing data).
How to Test: For each pair of different input data, the output data should match with respect to properties and characteristics.
The output of the data enrichment task MUST be consistent with the input and the task goals.
Why: A data enrichment task must always output the expected input according to the task definition
Intended Outcome: The output of the data enrichment task must be compliant with the expected result, defined according to the characteristics and properties of the input data.
Possible Approach to Implementation: Given an application domain, identify different input instances of a task that will require the same output, and verify the characteristics and properties of the output according to a relative evaluation criterion.
How to Test: Given any set of instances where the application domain requires the same output results, the actual result should match the expected output.
The data enrichment task MUST meet the resource demands and deadlines defined by its specification.
Why: A data enrichment task must always produce output in compliance with the expected cost (e.g., execution time) and demand for resources (e.g., storage).
Intended Outcome: It should be possible to enrich data without failing deadlines and exhausting the allocated resources.
Possible Approach to Implementation Different types of techniques can be used to guarantee the cost viability of a data enrichment task, including:
How to Test : Monitor the costs and resource utilization of each task execution, warning whenever it exceeds specification values.
Each data enrichment task SHOULD be compatible to different data types and application scenarios.
Why: Because it is necessary to deal with distinct data sets, which may contain different data types and properties.
Intended Outcome: It should be possible to enrich data achieving the efficiency of each task for different applications scenarios and data types.
Possible Approach to Implementation Different application domains may be used for evaluating the generality of a data enrichment task. The most important aspect to address is to deal with different attributes and data types, for example:
How to Test: Provide different application data sets for testing, with attributes of different data types.
This document has discussed the concept of Web data enrichment, which is paramount for the process of transforming data into useful information for decision making. We also described a subset of relevant and popular tasks performed during this process, including data fusion, imputation and categorization.
More importantly, we listed a set of 7 recommended best practices that can guarantee the data enrichment process is capable of dealing with the requirements of the era of ''big data''. By following them we can guarantee the data enrichment process will successfully met its goals.
 Katal, A., Wazid, M., & Goudar, R. H. (2013). Big Data: Issues, Challenges, Tools and Good Practices. IEEE, 404-409.
 Federico Castanedo, A Review of Data Fusion Techniques, The Scientific World Journal, vol. 2013, Article ID 704504, 19 pages, 2013. doi:10.1155/2013/704504
 Mislove, Alan, et al. You are who you know: inferring user profiles in online social networks. Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010.
 Nguyen, Dong, et al. How Old Do You Think I Am? A Study of Language and Age in Twitter. ICWSM. 2013.
 Pennacchiotti, Marco, and Ana-Maria Popescu. Democrats, republicans and starbucks afficionados: user classification in twitter. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
 Enders, C.K. (2010). Applied missing data analysis. New York: Guilford Press.
 Bai, Lu, et al. "Group sparse topical coding: from code to topic. Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013.
 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (2003): 993-1022.
 Yan, Xiaohui, et al. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. Proceedings of the SIAM International Conference on Data Mining. 2013.
 Bing Liu; Minqing Hu and Junsheng Cheng (2005). Opinion Observer: Analyzing and Comparing Opinions on the Web. Proceedings of WWW 2005.
 Wright, Alex. Mining the Web for Feelings, Not Facts, New York Times, 2009-08-23. Retrieved on 2009-10-01.