Monthly Archives: October 2014

Provenance and Cybersecurity

7562831366_66f986c3ea_o
Image: New Cybersecurity Partnership Merrill College of Journalism ©2012. CC BY-NC 2.0

When you find some data on the Web, do you have any information about how it got there? It is quite possible that it was copied from somewhere else on the Web, which, in turn may have also been copied; and in this process it may well have been transformed and edited…if you are a scientist, or any kind of scholar, you would like to have confidence in the accuracy and timeliness of the data that you are working with. (Buneman, Khanna, and Tan, 2000).

The provenance of a digital object is a record which represents the origins of such an object. This can help to verify the quality of data, i.e. for example where it came from, who altered it, who owns it, whether it can be trusted and how many versions have existed over time. In this way, systems can become more accountable when provenance is in place. However, not all information in the provenance record will be shared with everyone. Security of provenance records is essential, because relationships in the record reveal information that is not necessarily public information. Still, when the right security measures are in place, provenance can guarantee the trustworthiness of data.

Provenance is often expressed in PROV. PROV is a W3C specification that contains descriptions of the entities and activities that have been used to create or alter a digital object. PROV records are metadata. The most important elements of the records are entities, activities and agents. Entities are the actual digital objects that require to be explained by their relations. An example of an entity is a Web page. Activities describe how entities come into being and how they have changed over time. So, for example, when a new version of a document comes into existence, it will be described as an activity. The person or organisation that altered the document is the agent. Provenance is shown in PROV through the following formats: [PROV-O] RDF triples, [PROV-N] expressions and [PROV-XML] fragments.

Key work in this area is derived from the first Provenance Workshop (held in 2002), the bi-annual International Provenance and Annotation Workshops (IPAW) (held since 2006), USENIX Theory and Practice of Provenance Workshops (TaPP – held since 2009), the four Provenance Challenges (held between 2006 and 2010), and the newly inaugurated Provenance Week, which combined IPAW and TaPP and was held in Cologne in June this year.

Similar to Web Science, there is a multi-disciplinary approach to Provenance. The key research communities in this area are: Security (access control, authenticity and scalability), Database (theory, data collection, and dependency analysis), Workflow (system approaches and applications of Semantic Web technologies), Open Provenance Model (database technology), Provenance Challenge (approaches to inter-operability), e-Science (representing, querying, and automating data lineage in science databases) and electronic notebook research (semantically derived trust judgement reasoners).

As Provenance makes systems transparent and auditable (so that information can be reliably tracked on the Web), and prevents the misuse of data, there is also potential for the development of new compliance-checking services for financial institutions, businesses, and government, as well as education and research (Moreau, 2010). For example in the public realm the importance of trust in public records is vital to the legal processes and the business of government. Recently The Gazette (the UK’s official public record) has adopted the PROV model to its databases in order to establish trust in the authenticity of its records.

Another important area is the provision of tools to facilitate the adoption of Provenance by data managers. To this end the University of Southampton has developed a collection of tools, applications and standards within its Southampton Provenance Suite.

While Provenance standards are maturing (currently at version 0.6.1), there remain a number of significant challenges in the field. The Provenance Reconstruction Challenge highlights the need to spread the adoption of data provenance and calls on the research community to devise methods involving the use of data objects’ computational environment to faciltate improved authentication.

Further reading:

  • Buneman, P., Khanna, S., & Tan, W. C. (2000). Data provenance: Some basic issues. In FST TCS 2000: Foundations of software technology and theoretical computer science (pp. 87-93). Springer Berlin Heidelberg.
  • Buneman, P., Khanna, S. and Tan, W-C. (2001). Why and Where: A Characterization of Data Provenance. In Proceedings of 8th International Conference on Database Theory (ICDT’01), volume 1973 of Lecture Notes in Computer Science, pages 316–330, London, UK, 2001. Springer. (doi: http://dx.doi.org/10.1007/3-540-44503-X_20)
  • Ceolin, D., Moreau, L., O’Hara, K., Fokkink, W., Van Hage, W. R., Maccatrozzo, V., Sackley, A., Schreiber, G. and Shadbolt, N. (2014) Two procedures for analyzing the reliability of open government data. In, 15th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’2014), Montpellier, FR, 15 Jul 2014. 10pp. Link
  • Huynh, T. D. and Moreau, L. (2014) ProvStore: a public provenance repository. At 5th International Provenance and Annotation Workshop (IPAW’14), Cologne, DE, 09 – 13 Jun 2014. 3pp. Link
  • Moreau, L. (2010). The foundations for provenance on the web. Foundations and Trends in Web Science, 2(2–3), 99-241. http://eprints.soton.ac.uk/271691/
  • Moreau, L. (2014). Provenance in the Wild. [Blog] Available at: http://lucmoreau.wordpress.com/category/provenance-in-the-wild/
  • Moreau, L. (2014) Aggregation by Provenance Types: A Technique for Summarising Provenance Graphs. , Unpublished (Submitted) Link
  • Moreau, L. and Ali, M. (2014) A provenance-based policy control framework for cloud services. In, IPAW’2014: 5th International Provenance and Annotation Workshop, Cologne, DE, 09 – 13 Jun 2014. LNCS12pp. Link
  • Moreau, L., Huynh, T. D., & Michaelides, D. (2014). An Online Validator for Provenance: Algorithmic Design, Testing, and API. In Fundamental Approaches to Software Engineering (pp. 291-305). Springer Berlin Heidelberg. Link
  • Sezavar K., Amir, H., Trung D. and Moreau, L. (2014) Provenance for online decision making. International Provenance and Annotation Workshop, Cologne, DE, 09 – 13 Jun 2014. Springer12pp. Link
  • Weitzner, D. J., Abelson, H., Berners-Lee, T., Feigenbaum, J., Hendler, J. and Sussman, G. J. (2008). Information accountability. Commun. ACM, 51(6):81{87, June 2008. (doi: http://doi.acm.org/10.1145/1349026.1349043).

Professor Luc Moreau and Dr Trung Dong Huynh from WAIS at the University of Southampton were members of the Provenance working group at W3C when the primer for PROV was produced.