Provenance (in Computer Science)

Provenance is the ability to record the history of data and its place of origin. In general, it is the ability to determine the chronology of the ownership, custody or location of any object. The primary purpose of tracing the provenance of an object or entity is often to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody, and places of storage. While originally limited to determining the heritage of works of art, the term now applies to wide range of fields, including archaeology, paleontology, archives, manuscripts, printed books, and science and computing. The latter is the context most relevant to my field of computer security.

In the context of data provenance, provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis, including the ability to detect advanced/persistent threats. Data provenance can provide a full historical record of data and its origins and the provenance of data which is generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes.

The use of data provenance is proposed in distributed systems to trace records through a dataflow, replay the dataflow on a subset of its original inputs and debug data flows. In order to do so, one needs to keep track of the set of inputs to each operator, which were used to derive each of its outputs.

The w3c defines provenance as the ability to record a resource in order to describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.

Why do we care?

Because provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility and assertions of provenance can themselves become important records with their own provenance. The widespread use of workflow flow tools for processing scientific data facilitate for capturing provenance information. The workflow process describes all the steps involved in producing a given data set and, hence captures it provenance information. Provenance can be used to record metrics such as data creator/data publisher, data creation date, data modifier & modification date, or data description.

There are two major strands of provenance for computer science: Data Provenance and Workflow Provenance. Data provenance is fine-grain and is used to determine the integrity of data flows. It is a description of the origin of a piece of data and process by which it arrives in a database. By contrast workflow provenance is coarser in grain. It refers to records of history of the derivation of the final output of workflow and is typically used for complex processing tasks. Fine-grain provenance can further categorized into: where, how and why-Provenance. A query execution simply copy data elements from some source to some target database and where-provenance identifies these source elements where the data in the target is copied from. Why-provenance provides justification for the data elements appearing in the output and how-provenance describes some parts of the input influenced certain parts of the output.

References

wikipedia on data lineage
scale free networks
basic vector clock description

Busy IMA preparing for a Lt Col Board

Hello fellow IMA. My apologies to you. Life is not easy. In the civilian world, you work hard, play office politics and with a little luck you might get promoted. Not so in the reserves. Here your promotion depends heavily on your ability to decode a bunch of Air Force personnel jargon and to make a lot of non-cooperative admin types take care of someone who they really don’t see as their responsibility. I hope my story helps you out.

To start preparing for a recent board, I had to look up some basic information to answer the following:

  • When is my board?
  • How do I know if I’m eligible?
  • When is my PRF due? When does it have to be signed and where does it need to be delivered to?
  • How do I review (and potentially change my records)?

PRF

Before answering these questions I had to write my PRF. Why do IMAs write every word of their PRFs and OPRs? Because IMAs are always shafting their reserve boss because of the demands of our main job and the last thing we want to do is have someone go through the torture of the AF evaluation system when we’ve been so lame.

But nothing is easy — the only time I have to work this is while I’m flying from DC to Vegas and I’m on my Mac at 35 kft. I have a draft of last years PRF but it is in $xfdl$ format. My mac is not any mac, it is a government mac from my day job so I can’t install any software. Oh yes, this is totally doable, I’m an engineer. Bring it. So the XFDL is base64 zipped. To learn this, I connected to a free cloud based bash shell VPS (seriously cloud 9 IDE for the win) and cat the top of the xfdl and see:

application/vnd.xfdl;content-encoding="base64-gzip"
H4sIAAAAAAAAC+29eZea2NY4/L+fgjf3eZ+kl0khM3QneRYCKoqAguO6a/ViVBQBGZw+/e8ctGat

so no probs here . . . because I’m on a shell with root I can use uudeview under linux to decode a xfdl into a zipped xml file and then extracted it to view in emacs. Happy to explain this in more detail if you email me at tim@theboohers.org for other questions, I recommend you call the total force service center at Comm 210-565-0102.

uudeview my_prf.xfdl
mv UNKNOWN.001 my_prf.gz
gunzip my_prf.gz
cat my_prf

What do non-hacker IMAs do? Ok so I can parse XML easily enough to get the following from here.

The document to make sure you have in your hip pocket is AFI 36-2406 OFFICER AND ENLISTED EVALUATION SYSTEMS. It is probably the worst written document possible for quickly finding what you need, but it is the guide for how this is all supposed to work.

When is my board?

According to ARPCM 15-17 CY16 ResAF Board Schedule my board meets on 13-18 Jun. I found this via myPers or https://gum-crm.csd.disa.mil/.

It provides this excellent summary table:

Screen Shot 2016-03-01 at 7.14.06 PM

How do I know if I’m eligible?

The most helpful document was the ARPCM_16-02 CY16 USAFR Lt Col Convening Notice, which I dug around on MyPers to get. From this document I found out that I would need a date of rank for a Lieutenant Colonel Mandatory Participating Reserve (PR) board to be less than 30 Sep 10. I can see that my DOR is 29 APR 2010 and that fits in the window of the oldest and youngest members for the board:

DAILEY, MELISSA A./30 Sep 10 VANMETER, BRETT A./1 May 02

When is my PRF due? When does it have to be signed and where does it need to be delivered to?

From 36-2406, I know then that an eligible officer’s senior rater completes the PRF no earlier¬†than 60 days prior to the CSB: which for me is Thursday, April 14, 2016.

From the table above, I see this confirmed that my senior rater (the USD(P)) has to sign the document between 14 Apr 16 and 29 Apr 16 and I get the completed document by 14 May 16. I can’t find how the PRF gets to the board, but I’m just going to bug the unit admin until I can confirm the document is in.

How do I review (and potentially change my records)?

Check your records on PRDA. So I was missing two OPRs and an MSM. Wow. The key here was working my network and finding the (amazing) admin at ARPC/DPT who had direct access to the records database and was able to update it for me before the board.