Provenance (in Computer Science)

Provenance is the ability to record the history of data and its place of origin. In general, it is the ability to determine the chronology of the ownership, custody or location of any object. The primary purpose of tracing the provenance of an object or entity is often to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody, and places of storage. While originally limited to determining the heritage of works of art, the term now applies to wide range of fields, including archaeology, paleontology, archives, manuscripts, printed books, and science and computing. The latter is the context most relevant to my field of computer security.

In the context of data provenance, provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis, including the ability to detect advanced/persistent threats. Data provenance can provide a full historical record of data and its origins and the provenance of data which is generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes.

The use of data provenance is proposed in distributed systems to trace records through a dataflow, replay the dataflow on a subset of its original inputs and debug data flows. In order to do so, one needs to keep track of the set of inputs to each operator, which were used to derive each of its outputs.

The w3c defines provenance as the ability to record a resource in order to describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.

Why do we care?

Because provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility and assertions of provenance can themselves become important records with their own provenance. The widespread use of workflow flow tools for processing scientific data facilitate for capturing provenance information. The workflow process describes all the steps involved in producing a given data set and, hence captures it provenance information. Provenance can be used to record metrics such as data creator/data publisher, data creation date, data modifier & modification date, or data description.

There are two major strands of provenance for computer science: Data Provenance and Workflow Provenance. Data provenance is fine-grain and is used to determine the integrity of data flows. It is a description of the origin of a piece of data and process by which it arrives in a database. By contrast workflow provenance is coarser in grain. It refers to records of history of the derivation of the final output of workflow and is typically used for complex processing tasks. Fine-grain provenance can further categorized into: where, how and why-Provenance. A query execution simply copy data elements from some source to some target database and where-provenance identifies these source elements where the data in the target is copied from. Why-provenance provides justification for the data elements appearing in the output and how-provenance describes some parts of the input influenced certain parts of the output.

References

wikipedia on data lineage
scale free networks
basic vector clock description


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *