An approach to evaluate RDF data completeness

An approach to evaluate RDF data completeness

Fayçal Hamdi Samira Si-said Cherfi

Laboratoire Cédric, Conservatoire national des arts et métiers Paris, France

Corresponding Author Email: 
{faycal.hamdi,samira.cherfi}@cnam.fr
Page: 
31-52
|
DOI: 
https://doi.org/10.3166/ISI.21.3.31-52
Received: 
N/A
| |
Accepted: 
N/A
| | Citation
Abstract: 

With the development of data based applications, data quality becomes a burning issue in the context of the Web of Data. Organizations as well as researchers need suitable methods and techniques to help ensuring web data quality along the whole process, from data transformation and publication to data querying and exploitation. Among quality dimensions, completeness is recognized as difficult to evaluate, as it often relies on gold standards and/or a reference schema that are neither always available nor realistic from a practical point of view. In this paper, we propose an approach for the assessment of RDF data completeness. The proposed solution consists, first, on inferring a schema using a frequent itemset mining approach, and second, on measuring the completeness regarding the inferred schema. The paper presents both theoretical background and experimental results performed on real-world RDF datasets. 

Keywords: 

linked Data, RDF data quality, completeness, quality evaluation

1. Introduction
2. Illustration par l’exemple
3. Problématique
4. Extraction du schéma d’une source de données RDF
5. Évaluation empiriquevv
6. État de l’art
7. Conclusion
  References

Ballou D. P., Pazer H. L. (2003). Modeling completeness versus consistency tradeoffs in information decision contexts. Knowledge and Data Engineering, IEEE Transactions on, vol. 15, no 1, p. 240–243.

Batini C., Cappiello C., Francalanci C., Maurino A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), vol. 41, no 3, p. 16.

Bechhofer S., Buchan I., De Roure D., Missier P., Ainsworth J., Bhagat J. et al. (2013). Why linked data is not enough for scientists. Future Generation Computer Systems, vol. 29, no 2, p. 599–611.

Berti-Equille L., Comyn-Wattiau I., Cosquer M., Kedad Z., Nugier S., Peralta V. et al. (2011). Assessment and analysis of information quality: a multidimensional model and case studies. IJIQ, vol. 2, no 4, p. 300–323.

Chen P., Garcia W. (2010). Hypothesis generation and data quality assessment through association mining. In F. Sun, Y. Wang, J. Lu, B. Zhang, W. Kinsner, L. A. Zadeh (Eds.), Proceedings of the 9th IEEE international conference on cognitive informatics, ICCI 2010, july 7-9, 2010, beijing, china, p. 659–666. IEEE.

Codd E. F. (1986). Missing information (applicable and inapplicable) in relational databases. SIGMOD Record, vol. 15, no 4, p. 53–78.

Darari F., Nutt W., Pirrò G., Razniewski S. (2013). Completeness statements about RDF data sources and their use for query answering. In H. Alani et al. (Eds.), The semantic web -ISWC 2013 - 12th international semantic web conference, sydney, nsw, australia, october 21-25, 2013, proceedings, part I, vol. 8218, p. 66–83. Springer.

Eastman C. M., Jansen B. J. (2003). Coverage, relevance, and ranking: The impact of query operators on web search engine results. ACM Transactions on Information Systems (TOIS), vol. 21, no 4, p. 383–411.

Fürber C., Hepp M. (2011). Swiqa-a semantic web information quality assessment framework. In Ecis, vol. 15, p. 19.

Golbeck J. (2006). Combining provenance with trust in social networks for semantic web content filtering. In L. Moreau, I. T. Foster (Eds.), Provenance and annotation of data, international provenance and annotation workshop, IPAW 2006, chicago, il, usa, may 3-5, 2006, revised selected papers, vol. 4145, p. 101–108. Springer.

Gouda K., Zaki M. J. (2001). Efficiently mining maximal frequent itemsets. In Proceedings of the 2001 ieee international conference on data mining, p. 163–170. Washington, DC, USA, IEEE Computer Society.

Grahne G., Zhu J. (2003). Efficiently using prefix-trees in mining frequent itemsets. In B. Goethals,

M. J. Zaki (Eds.), FIMI ’03, frequent itemset mining implementations, proceedings of the ICDM 2003 workshop on frequent itemset mining implementations, 19 december 2003, melbourne, florida, USA, vol. 90. CEUR-WS.org.

Han J., Pei J., Yin Y. (2000). Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, P. A. Bernstein (Eds.), Proceedings of the 2000 ACM SIGMOD international conference on management of data, may 16-18, 2000, dallas, texas, USA., p. 1–12. ACM.

Han J., Pei J., Yin Y., Mao R. (2004, janvier). Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Min. Knowl. Discov., vol. 8, no 1, p. 53–87.

Hartig O. (2008). Trustworthiness of data on the web. In Proceedings of the sti berlin & csw phd workshop.

Hartig O., Zhao J. (2009). Using web data provenance for quality assessment. In J. Freire, P. Missier, S. S. Sahoo (Eds.), Proceedings of the first international workshop on the role of semantic web in provenance management (SWPM 2009), collocated with the 8th international

semantic web conference (iswc-2009), washington dc, usa, october 25, 2009, vol. 526. CEUR-WS.org.

Herzig D. M., Tran T. (2012). Heterogeneous web data search using relevance-based on the fly data integration. In A. Mille, F. L. Gandon, J. Misselis, M. Rabinovich, S. Staab (Eds.), Proceedings of the 21st world wide web conference 2012, WWW 2012, lyon, france, april 16-20, 2012, p. 141–150. ACM.

Hogan A., Harth A., Passant A., Decker S., Polleres A. (2010). Weaving the pedantic web. In C. Bizer, T. Heath, T. Berners-Lee, M. Hausenblas (Eds.), Proceedings of the WWW2010 workshop on linked data on the web, LDOW 2010, raleigh, usa, april 27, 2010, vol. 628. CEUR-WS.org.

Institute M. G., Chui M., Manyika J., Bughin J., Dobbs R., Roxburgh C. et al. (2012). The social economy: Unlocking value and productivity through social technologies. McKinsey Global Institute.

Jr. R. J. B. (1998). Efficiently mining long patterns from databases. In L. M. Haas, A. Tiwary (Eds.), SIGMOD 1998, proceedings ACM SIGMOD international conference on management of data, june 2-4, 1998, seattle, washington, USA., p. 85–93. ACM Press.

Lee Y.W., Strong D. M., Kahn B. K.,Wang R. Y. (2002). Aimq: a methodology for information quality assessment. Information & management, vol. 40, no 2, p. 133–146.

Markovic M., Edwards P., Corsar D., Pan J. Z. (2012). The crowd and the web of linked data: A provenance perspective. In Wisdom of the crowd, papers from the 2012 AAAI spring symposium, palo alto, california, usa, march 26-28, 2012.

Mendes P. N., Bizer C., Young Y., Miklos Z., Calbimonte J., Moraru A. (2012). Conceptual model and best practices for high-quality metadata. Deliverable 2.1 of PlanetData, FP7 project 257641 (2012).

Mendes P. N., Mühleisen H., Bizer C. (2012). Sieve: linked data quality assessment and fusion. In Proceedings of the 2012 joint edbt/icdt workshops, p. 116–123.

Naumann F., Freytag J.-C., Leser U. (2004). Completeness of integrated information sources. Information Systems, vol. 29, no 7, p. 583–615.

Omitola T., Gibbins N., Shadbolt N. (2010, February). Provenance in Linked Data Integration. In S. Auer, S. Decker, M. Hauswirth (Eds.), Proc. of Linked Data in the Future Internet at the Future Internet Assembly, Ghent 16/17 Dec 2010, vol. 700.

Pipino L. L., Lee Y. W., Wang R. Y. (2002). Data quality assessment. Communications of the ACM, vol. 45, no 4, p. 211–218.

Samwald M., Jentzsch A., Bouton C., Kallesøe C. S., Willighagen E., Hajagos J. et al. (2011). Linked open drug data for pharmaceutical research and development. Journal of cheminformatics, vol. 3, no 1, p. 19.

Wang R. Y., Strong D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of management information systems, p. 5–33.

Zaveri A., Rula A., Maurino A., Pietrobon R., Lehmann J., Auer S. et al. (2013). Quality assessment methodologies for linked open data. Submitted to Semantic Web Journal.