Quantcast
Channel: Tracing Enterprise Data Footsteps! ……celebrating the journey of data!
Viewing all 51 articles
Browse latest View live

Open IGC: Defining a new bundle!

$
0
0

This post discuses another topic regarding Open IGC — the new “extensibility” API that allows you to define your own objects inside the Information Server repository, and then govern them with Stewardship, Business Terms, or detailed lineage reporting.

First post in this series:
https://dsrealtime.wordpress.com/2015/07/29/open-igc-is-here/

Previous post in this series:
https://dsrealtime.wordpress.com/2015/08/07/open-igc-a-simple-messaging-system-use-case/

So…….you have decided that you want to extend the Information Server repository, and have decided that you want to create your own custom objects with their own icons and their own internal relationships. Now what?

Your first goal is to model what you want or need to represent. What objects do you want to govern? What kinds of lineage do you want to display? See the prior posts in this series for some ideas of what this might mean. Also, be sure to look at the documentation, and play with the real examples for extensibility that are included. [formal documentation for the Open IGC is here: http://www-01.ibm.com/support/docview.wss?uid=swg21699130 ].

Work it out first on paper, or on a whiteboard. What objects do you want a user to be able to click on, and request lineage? What levels in your database schemas do you want to show as connected objects in a lineage graph? If you are illustrating a process, one that has sub-processes and even additional sub-sub-processes, at what level do you want to provide a drill down or “expand” capability to the user for additional detail?

These specifications for your new object types are outlined in a “bundle”. The bundle represents each of the new object types and their icons that you will be defining. The bundle describes the relationships between the objects (parent/child or other “containment” definition) and also captures all of the individual properties (and their data types) for those objects. It establishes formal property names for their use in your code and in the user interface.

The bundle is defined using XML. A well documented xml schema is provided with the Open IGC, and is fairly easy to follow, even if you don’t spend much time with XML. Here is a snippet of the bundle I used for the prior post, to define my “Messaging” environment:

bundleSnippet

Specifically note the class element at the top, named “queue”, and its various properties such as default_persistence towards the bottom. It’s parent is “queue_manager” (defined earlier in the xml). Note also the “header section”. These properties will appear towards the top of the detail page when a user is reviewing this object, and also will be shown in the unique “hover” view that is available throughout all of IGC. Properties can be defined as simple strings, integers, float, etc. and also with enumerated types, as illustrated here. When using an enumerated type, the pre-defined list of values is automatically provided in a drop-down selection for any Steward who might edit this object. The values are also validated when objects are entered via API into the system.

The bundle xml, known as the “asset descriptor” is arranged alongside two special folders for language conventions (not yet fully supported) and custom matching icons:

bundleFolderStructure

Your matching icons are placed into an “icon” folder, following a naming convention for their class and size. As documented, the supported icon sizes are 32×32 (big) and 16×16 (small). These different icons will then appear in various places in IGC, depending on the context and what the user is doing.

iconList

Ultimately, the asset_descriptor.xml and the two folders are zipped together into a single archive:

bundleZip

A good practice to follow is to name the .zip file by the name of your bundle.

This is the file that is sent to Information Server to formally “register” your new objects (go to https://:/ibm/iis/igc-rest-explorer ). This can be done programmatically, of course, but the igc-rest-explorer page makes this very convenient, especially when you are first getting started, or if you haven’t done much with REST apis and their invocation as an HTTP based web service. In a later post, I will discuss various ways of making these calls in an automated fashion. Here is a screen shot of how this looks:

register

Click on “bundle” when you first get to the igc-rest-explorer page, and then POST for registering a new bundle. A convenient “browse” button allows you to select your bundle zip file and then just click “Try it Out!”. It is very simple to get started! Error checking is very thorough — if you mess up your bundle, the IGC registration will let you know. Here I have made a very simple error, trying to re-register the same bundle:

registrationError

…it also picks up other subtle errors that you might make when defining your new objects.

When the registration works, you will get a clean confirmation, and can then immediately go and see the results of your creative thinking and design efforts! I like to immediately check the “browse all assets” list, to see what new icons and bundle “section” I have:

MessagingIcons

I also like to immediately select one of my objects in the IGC Query tool, and check to see that my special Open IGC custom properties are showing up as I expect:

igcQueryWithProperties

If you need to make updates to your bundle, such as add new object types or properties or make mild changes to the labels or names shown in the user interface, or add/change icons, there is REST call (also available at the igc-rest-explorer page) to “Update a previously registered asset bundle”. You cannot make radical structural changes, or alter datatypes or the formal names of registered objects, but simple changes and additions are permitted. If you make changes to your icons, or add new ones, be sure to clear your browser cache to ensure that they are visible the next time you return to and refresh the browse page.

That’s it! Now I am ready to start adding real instances of my new objects to the repository and start governing!!!

Ernie



Open IGC: Uploading New Assets!

$
0
0

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Open IGC: Defining a new bundle!

Original post in this series:
Open IGC is here!

At this point you have your bundle defined, you can see your objects and their icons in the “browse assets” page, and the detailed properties of your new objects are visible within the Query Tool. Congratulations! Now you are ready to start loading new assets, or new “instances” of the objects that you have modeled with your bundle design.

New assets are added using XML and another REST call, this time a special POST for the upload of new assets. The documentation (see URL in the prior blog entry) includes example xml documents and the xsd, but let’s look more closely at one here.

Our Messaging bundle describes a simple hierarchy that has Queue Managers at the highest level, and then Queues. Queue Managers also have “Listeners”. These are the major objects in the bundle. Initially I am just defining new Queue Managers and their Queues. To keep my sanity while initially learning and playing with the API, I am creating a single xml document for each Queue Manager. This is not a requirement, but keeping the documents small, and focused on one higher level object in the hierarchy will help you understand the structure of the xml and speed up your learning curve. Each whole “xml document” or “xml string” is what you will be passing in a single http POST when performing the actual upload.

Here is a list of an initial set of these xml documents.

xmlDocsForPublishing

To stay organized, I keep them in a folder structure, per bundle type, that has a subdirectory for bundle details (see prior blog entry), a subdirectory for publishing new assets, and a subdirectory for publishing “flows” (for lineage…a future post). Ultimately, many of these xml documents will be built “on the fly” in your programs that craft the interface between Information Governance Catalog and whatever you are modeling with your bundle. However, the simplest way to learn the Open IGC is by using static xml documents. Depending on your use case, some of you may only have a few objects to govern, and might always use this file based approach.

highLevelBundleFolderStructure

Let’s take a closer look at one of the publishing xml documents:

publishXML

The elements and attributes above are well documented in the examples, so I won’t go into excruciating detail, but want to point out several items.

1. Your custom properties (light green box). Note how their names each begin with a dollar sign. This uniquely identifies them as “yours”. Every object gets name, short_description, and long_description. Think of these as “free” in your bundle. You didn’t need to define them in the bundle — they are just “there”. As such, they don’t require the dollar sign prefix.

2. The value of the repr attribute in the object header, and the string used for the “value” attribute for “name” immediately below it (purple box) must be identical! This is for internal reasons. It is a requirement of the API. You will get a nice error if they are not identical, so am pointing it out here to save you the trouble.

3. The ID value (red boxes) is a unique identifier for the asset within this xml document. It is just an internal reference that is used throughout this particular xml document (it doesn’t have any overall system significance). It is critical for establishing the hierarchy of your objects and will be even more important when you learn about the “flow” xml for lineage.

4. The “reference” element (blue box) is what helps establish the hierarchy, identifying the parent asset (if applicable). Note the use of “ID”.

Another very important part of the publishing xml is the “importAction” element at the bottom of your xml document. This is an important property that controls the behavior of the API when managing a complex hierarchy. This can be a difficult concept to understand, but I will do my best to explain it here.

partialAssetID

The element importAction has two attributes, partialAssetIDs and completeAssetIDs. These attributes contain a set of comma delimited IDs from up above in the xml document. They describe whether a particular asset, in this xml document, is being uploaded with ALL of its children, or only “some”. If the parent ID is listed in “completeAssetIDs”, then the parent and its collection of child objects is considered complete; any pre-existing child instances will be blown away. If you want to preserve the pre-existing children for a particular parent, place its ID in “partialAssetIDs”.

Once you have built your xml document, and have checked it for well-formed-ness (at the very least, make sure you can open it in your browser as a well-formed and recognized xml document), you are ready to upload it to IGC. Go to the igc-rest-explorer page for the Open IGC API and find the bundle “POST” invocation for publishing assets:

publishAssets

Open your xml document in a regular editor and copy/paste the entire xml string into the available property (red box in the screen shot above) and then click “Try it out!”

If there are any errors, you will receive them here directly, and if all is “ok”, you will receive a clean 200 response code, and your assets will have been loaded.

successReturnCode

At this point, you can immediately return to the Information Governance Catalog and view your new assets!

NewAssets

Browse them by returning to the main “Information Assets…Browse All” pull down where you found the icons for your bundle, and then look around….see if your child assets are also loaded, and how they are displayed “within” the parent! Try doing a Query. Edit one of your new assets and make adjustments to one of the properties!

Your assets are now being governed…they can be assigned Terms and Labels, belong to Collections, become the responsibility of a Steward — everything that you can do within Information Governance Catalog is now available for your new objects! In the next post we will look at how you can apply your own custom flow definitions for data lineage that includes your new object instances.

–ernie


Defining Lineage Flows (Part 1)

$
0
0

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Uploading New Assets!

Original post in this series:
Open IGC is here!

If you have been following along, we finished designing our first simple “bundle” and uploaded some new instances of those objects. Our use case for that first bundle was a not-so-standard source and target; a set of “message queues”. The objects we created for that use case “might” contain their own lineage, but it is more likely that they will simply participate as end-points in other lineage definitions. Once created, they can be referenced by Extension Mappings or by the new Open IGC.

Now let’s get into the use case for a true “data mover” by defining a set of objects that actually move and transform data — programs, scripts, stored procedures, independent ETL tools, java, etc. Open IGC provides constructs that allow us to define, and therefore graphically illustrate the processes and sub-processes that we use for transformation. Further, it allows us to describe internal and external flows, to establish a “zoom” point where we can “dive in” for more detail (“Expand” for those of you who use DataStage lineage today), and to specify both Design and Operational level lineage. There are other goodies too, so let’s get right to it.

For this exercise I have defined a new bundle. I call it “PixelStage”. This is a fictitious ETL tool from the future that moves and transforms light beams. ;) I took the liberty of using this example for the object types and their properties to force me to think “outside the box” and frankly, to keep things light (no pun intended) and interesting. Ultimately I morphed back to a fairly normal “column lineage” and data oriented paradigm, but this approach helped with the early learning curve over six months ago. You already know how to construct a bundle, so I will cover the highlights of what makes this bundle just a “little bit” different from our Messaging one.

First we define our bundle ID, and then lay out a hierarchy of object types. The “Processes” we are defining belong to a “Workspace”, and then inside each “Process” we will be defining a set of “Tasks”. By analogy to DataStage, this is like Project, Job, and Stage. Many programming disciplines have a similar structures (albeit deeper or shallower) that you can describe in this fashion. Beneath “Tasks” will be “Columns”, the lowest unit of data flow for our lineage definition. Here is the screen shot of our new family of objects in this bundle:

(click on any image in this post to enlarge it in its own window; use your “back” key to return to the post)…

pixelStageFamily

Looking more closely at this bundle, here are some very interesting and important properties:

expandableInLineage=”true”

expandable

This defines the “summary” level that you want to initially appear by default in your lineage reports. One object in each independent hierarchical path can have this property. Here we are defining expandableInLineage on each “Process”. This means that the user, upon seeing lineage initially displayed at the “Process” level, can drill deeper by clicking on the “Expand” link and see lineage that is “inside” that Process…at its underlying “Task” level:

expandLink

expandedProcess

While this diagram looks a little bit like an expanded DataStage Job, you can quickly see that some of the icons are unique (I stole the others from DataStage for this example because I didn’t have time to play in MS Paint!). Each icon inside the Process is identifying a different “Task” in this bundle, and each with its own internal lineage showing flows from one Task to another. The user can then hover over and further examine a Task, and then request lineage on one of its columns:

columnLineage

So you can see how Open IGC lets us define and then explore, very fine-grained lineage patterns.

Another interesting property, especially when defining lineage for data movement tools that have their own graphical development paradigm is canHaveImage=”true”. This is a nice feature that allows an IGC metadata author to edit the object and include a static screen shot for better identification and governance purposes.

The subprocesses for any transformation tool or process that you describe will often have different purposes; different functions that they apply. In our use case they are all still called “Tasks”, as they each belong to an overall “Process”, but each having their own unique properties. Open IGC allows us to reflect this relationship in the bundle, simplifying our definitions by supporting the inheritance of common properties. Here we see the overall definition of a class called “Task”, with some Header properties that will be common to all Tasks I define:

mainTask

As I define additional task types and their custom properties, I refer back to the overall “Task” definition using the “superClass” attribute:

superClass01

Class “Converter” above inherits Header properties from object “Task”, but further defines its own (inWaveLength and outWaveLength) and we see this again in the Reader subclass that has properties to keep track of security credentials:

superClass02

While we are here it is worth noting that there are often objects in a bundle that you might not want to have ANY definition for lineage. Objects that you still want to govern, and provide icons for, but not allow the user to ask for “Data Lineage”. In this example, I want to illustrate “Variables” used by a Process. We may want to represent any number of them, and have them appear in lists, with their own icons, and available for independent reporting — but not be something that directly participates in a “flow”. Note the attribute called dataAccessRole=”none”. This indicates that the object cannot be directly defined for lineage, and the icon that a user clicks to request lineage will not appear for this object in a hover window or on its detail page.

dataAccessRole

The variables still appear in a Process detail page, but don’t illustrate lineage themselves:

dataAccessRole

variable

Whew. This post is getting long. Next time we will see how we get all of these objects connected to one another and to other assets in the enterprise.

–ernie


Defining Lineage Flows (Part 2)

$
0
0

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Defining Lineage Flows (Part 1)

Original post in this series:
Open IGC is here!

Now it is time to start connecting your assets, processes and objects together to complete your illustration of lineage.  In our previous discussion we reviewed a bundle that describes a data movement and transformation process — complete with inner functionality and flows that are consistent with common ETL and programming patterns.   Of course, once you have defined this new bundle, you need to upload instances for it (instances of the objects that represent your actual programs, processes, and sub-processes).  We described this effort in an earlier post [https://dsrealtime.wordpress.com/2015/08/20/open-igc-uploading-new-assets/ ].   Once our instances are loaded, we will want to describe for the Open IGC exactly how those inner flows are tied to each other, and how they link to other enterprise assets that define our sources and targets.    Ultimately, the chaining of sources, targets and processes, along with all the other lineage definitions already captured or known to the Information Governance Catalog (DataStage, QualityStage, Extension Mappings, lineage via SQL views, business intelligence tools, FastTrack, etc.) will give us a complete end-to-end view of the lineage for the entire data integration lifecycle.

Lineage via Open IGC is defined by the ingestion of a “flow” document.   Like the import of new assets, this is an xml document, defined in a single REST call to Open IGC.   This document first lists the inventory of assets that will be used in lineage definitions, and then defines the exact source and target specifications (what is connected to what) that represent the flow of data.   Let’s first look at the list of assets.   Here we see a snippet of the flow doc that provides an inventory of our “bundle” assets.    Each asset node identifies one “instance” of an asset that will be used in a lineage flow, or else identifies the parent of an asset within its hierarchy so that it can be properly located by Open IGC: (click directly on the image for a larger view and then use your “back” button to return)

assetListForFlows

The asset ID=”w1″ (red box) above is an arbitrary value, hand coded here but usually generated programmatically.   As with the uploading of new asset instances, this value is pertinent only for this xml document in this invocation of a REST call to Open IGC.  It is not a persistent value connected to this resource.   The purple boxes identify the critical parts of the hierarchy leading to the $PixelStage-Column that will be formally used as a source or target.

Further down in our asset inventory we see the identification for the Database Columns of a Database Table.  Like our newly loaded bundle assets, Database Columns belong to a hierarchy, each level of which is properly identified.  Notice that I don’t need to provide any detailed properties here and in the example above…just the identity information (name in this case) and the containment relation for its parent.

assetListDatabaseForFlows

Once again, asset ID = “db1” is arbitrary and unique only for this xml.  The purple boxes identify the hierarchy that leads to Database Column “mycol1”.   This should be familiar to you when you review the hierarchy of any Implementation Model with the Information Governance Catalog browser interface.   We are simply identifying each part of the “tree”.

Similarly, here is the identification for the Data File Fields of a Data File.  I don’t have to go down to this level, but it is a best practice to define lineage at the lowest possible point in the hierarchy, which is generally columns and fields.    At the very least, aim to define lineage at the table level — lineage results will be more clear for your end users.

assetListDataFileforFlows

This identification of objects needs to be done for anything that you want to include in your lineage path.  Data Files and Database Table assets are often the most common, but any object that is available for lineage in IGC is a potential candidate.  Business Intelligence assets, members of an Extended Data Source collection, or parts of other bundles, such as the Messaging objects we reviewed in an earlier post of this series.    How do you figure out that hierarchy, and learn the object class names?  Well….admittedly, that can be tricky, although once you deal with the most common ones for awhile, you will become familiar with the names and their relationships.   It is important that you become familiar with all the tooling that is available at the igc-rest-explorer page that we have reviewed in earlier posts.   The “Types” section and the “Assets” section are invaluable for reviewing the class names for primary objects and their properties…and Open IGC will be sure to remind you with useful errors about not finding an asset if you spell a class incorrectly or guess wrong on the hierarchy.

After we have identified our “inventory of assets” we are ready to connect them.  Here we bring our attention to the “flowUnit” nodes of the xml document.  Each “flowUnit” is associated with an asset (usually a higher level asset, such as a whole Job or Process) and has a collection of individual “flows” that are the detailed unit for a point-to-point source/target specification.    Let’s look at a representative sample and identify some of its meaningful parts:

flowUnit

The first important attribute in the flowUnit xml element is assetID=”p1″.  This is the main asset that is associated with this flow unit. This refers to the in-document assetID that is associated with each node up above in our asset inventory (in the initial screen shot above in this post, assetID=”p3″ describes a Process called lookupCustomer and would be the typical asset for flowUnit details). The value “p1” identifies a whole process object in our bundle hierarchy.   An entire “Job”.  This also might be a single “instance” of an object that references a formal “execution” or “run” of this Job, if such an object is defined in your bundle.   In this scenario, the next interesting attribute, flowType=”DESIGN” provides a “descriptor” for the kind of lineage I am defining.   This value will appear for the user when they use their mouse and “hover” over a particular line/arrow in a lineage diagram.   “DESIGN” represents the “intended” lineage for this process — and perhaps might be a way to show the processes own default values as coded by a developer.    “DESIGN” might not be needed for your use case — many times you might only need “OPERATIONAL” for the flowType, when the lineage you are defining reflects an actual run-time history of the process and the data that flowed through it.

Now look carefully at the “flow” element above. Very simple. It has two critical xml attributes. One for source IDs…and one for targets. These point to other assetIDs from your the asset inventory you defined above. This is the ultimate key to defining your lineage. Lay out your point-to-point lineage connections here. IGC will aggregate and summarize your low level lineage specifications to display a larger lineage rendering.

Once the lineage specifications have been outlined, the xml is uploaded via Open IGC using the POST that is available in the Flows section of the igc-rest-explorer.

flowsCall

As with other call samples on the igc-rest-explorer “learn and test” page, there is a property where you can paste your xml payload, and then an example of the formal URL and expected response that you will use within the formal Open IGC interface you are developing.

If all goes well, your flow xml will be uploaded successfully and you can view lineage for your defined process! Lineage can be invoked in many ways — when initially testing lineage for new process assets I try to start lineage from the overall “process” asset itself. This will generally show me all of the lineage connections that were defined in that xml submission. Then you can move on and validate other lineage connections, starting on various assets that are significant for your use case.

expandLink

…and drill in deeper with “Expand” if you have enabled that capability in your bundle!

expandedProcess

…and as noted earlier, you can optionally hover over an individual lineage “arrow” and see the flowType for that particular data flow connection.

With this post, you now have reviewed all the basic ingredients you need to (1) design and register a new “bundle” of custom assets, (2) load various assets that you want to govern and make available for lineage, and (3) define and render the lineage that illustrates the flow of data through your systems.

In the next post we will start looking at advanced topics for fine tuning your lineage displays, updating bundles, etc.

–ernie


Open IGC Advanced Topics: Virtual Assets

$
0
0

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Defining Lineage Flows (Part 2)

Original post in this series:
Open IGC is here!

In the previous post, we reviewed how you define a formal lineage “flow” — first by defining the “inventory” of assets that will be established as sources and targets in your exact flow specification, and then the flowUnit xml node that explicitly states what will be the “source” and “target” for each point to point connection.

Assets, as we reviewed, might be Data Files, Database Columns, objects from other Bundles — anything that is able to participate in a lineage report within the Information Governance Catalog. We looked at the how you define the hierarchy, identifying (for example), the Host, Database, Schema, table and specific column name for a Database Column that has been formally imported into the repository (at an earlier time, via Metadata Asset Manager or other mechanism).

(as a reminder, here is the hierarchy that identifies a Database Column to be used in a flowUnit)

assetListDatabaseForFlows

(and here is a flow unit that includes that column)

flowElement_y1

But what happens if you haven’t yet imported that Database Table and its columns? What if this is a temporary table, with a dynamic, time-stamp generated name, and you don’t care about ever formally importing it into the repository for governance purposes? What if you simply made a typo in your code, picked up the wrong name from somewhere in your program, or were given mis-information by the tool whose lineage you are recording? Open IGC supports the idea of a “Virtual Asset”. This allows you to define the objects that will be seen in a lineage report as a source or a target, but without any concern about whether they actually exist in the repository. These assets appear in the lineage diagram, but will be slightly greyed out, to indicate their status as a “Virtual Asset”.

In the first screen shot above, Database Column mycol1 doesn’t really exist. I have never imported it. It is used for illustration purposes, but could also easily be a column in a temporary table that only exists for a given run of the application. Note that it still appears in the lineage report, but with a slightly greyed out appearance. All the details from the definition above will appear in the report…the “red” box in the screen shot below (the top source icon on the left) identifies the “Virtual Asset”:

virtAssetRealAsset

This Virtual Asset is viewed here in lineage, and you can even click on it directly and go to its detail page. However, it is considered “non governable”. This means that you can’t assign Terms or Stewards to it, or use it in Collection, or anything else related to governance. It is a tool to assist you in enabling lineage, providing additional insight where needed regarding the flow of your data. If it is truly an important asset, then it makes sense to formally import it and give it a full definition in the flow xml.

If an asset is found (using its name based object identity) in the repository, then it appears in a clear font and fully colored icon, without being greyed out. The green box (the bottom icon on the left in the lineage picture) identifies a “real” asset. This is an asset that truly exists in the repository that was imported earlier by formal means, and is fully governable (searchable, can be assigned Terms and Stewards, etc.).

Virtual Assets can be created for native IGC objects or for assets that you have created with your bundles. They are a powerful mechanism for illustrating lineage quickly and simply, without worrying about whether metadata has been formally imported or defined elsewhere. Later on, if metadata is imported and matches your flow XML, the Virtual Asset will become “real” in each lineage report.

Virtual Assets allow you to illustrate objects in lineage that don’t require governance, but need to be shown so that users fully understand the big picture for your overall data flows. They enable you to more quickly get your lineage solutions up and running for all IGC users.

–ernie


DataStage on YARN! …running in Hadoop!

$
0
0

Hi all…

Just a quick note. Yesterday we announced Information Server 11.5. It has some new features for governance, such as support for XML and also for detailed data classifications…..and it also has the ability for DataStage Jobs to run in Hadoop, controlled by YARN!

One of my colleagues with deep experience with Hadoop has written a very nice post on this exciting new capability… http://bit.ly/1Kgk2Lg

Be the first to start using this feature to take additional advantage of your enterprise’s investment in Hadoop!

Ernie


Updating Your Bundle

$
0
0

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Open IGC Advanced Topics: Virtual Assets

Original post in this series:
Open IGC is here!

So….by now you have a custom bundle, loaded with instances of new objects for governance. You can see them in the Query tool of IGC, and view their icons when browsing all information assets. Hopefully you have tried to assign them to Business Terms and give them Stewards, and maybe even given them additional meaning by including them in a lineage flow.

Equally important, I hope that you have shown them to your colleagues and other members of your governance team, and also many of the general business and technical users in your enterprise! How did they like it? Have you illustrated a concept critical to your business that they can follow? Do they understand this concept better than they did before? Did they have any questions? It is very important that you expose your new bundle and its purpose to your entire use community. Their feedback will be critical as you fine tune the solution and make it a regular part of your governance activities.

If you have done all of these things, then it is likely that you need to make adjustments and changes. Maybe the labels for your new objects aren’t descriptive enough. Maybe you made a typo. Perhaps you need some alternative structures, or want to tweak the behavior of an object when it is used within a lineage report. How do you apply changes to a bundle? You could just delete the whole thing and start over, registering the bundle again, but that might not be necessary. What if you have already loaded a lot of assets, and lineage flows — it would be frustrating to have to run those calls again or manually re-load all of that metadata. There are some changes you can safely make to a bundle without receiving any errors:

— Add a new class
— Add a new property to a class
— Change label names
— Change default locale label
— Change label properties files
— Change dataAccessRole, expandableInLineage
— Add or change icons
— Add another literal to an existing enumeration (for an object declared as having an enumerated list).

What you cannot change (requires deletion and re-registering of your bundle):

— rename a class or attribute
— change a datatype
— remove a class or attribute
— change containment (change parentage)
— change inheritance (superclass)

Make and then save your changes to your original the asset_type_descriptor.xml, and then zip up the asset_descriptor, your icon subdirectory and the language subdirectory. Apply the updates using the PUT call that is available at the igc-rest-explorer page:

(click on the image for a larger picture in its own window)

updateBundle

If your changes are in the “allowed list” above, and there aren’t any other errors, your update will be applied successfully and you can immediately see the impact on your existing object instances. If your changes are not in the allowed list, you will need to entirely delete your bundle, apply the changed bundle, and then re-load the instances.

Happy “bundling!”

–ernie


Sample Bundles

$
0
0

This entry is one of many in a series that describes the InfoSphere Open IGC API, which allows you to define your own objects for information governance using InfoSphere Information Server and the Information Governance Catalog.

Previous post in this series:
Updating Your Bundle

Original post in this series:
Open IGC is here!

Here are some sample bundles for you! These bundles correspond to the use cases that I have been describing within this blog series (see “Original post” above). Each .zip file contains a directory structure that is formatted as I described in one of my early posts on bundle design (Open IGC: Defining a new bundle!). These bundles are for demonstration and learning purposes only. There are no warranties or certified methodologies implied.

Each bundle is complete with the asset_type_descriptor, along with several instance publishing upload files and one or more flow model uploads (if applicable to the use case). I have tried to include examples of various techniques, some of which I have already reviewed in these posts, or intend to in the near future. The values for various string properties are fictitious, and in some cases, just repeated and copied in the interest of more quickly building the example. This is especially true with the asset_ids (attribute ID= in the publishing and flow upload xmls), whose values are fairly random. These xml documents were crafted by hand — a good way to start testing — but ultimately, most of you will probably generate these unique identifiers programmatically. The prior posts in this series are enough to help you take these examples, register their bundles and upload their assets and lineage specifications. Then you can play with the instances within IGC, add new ones, update property values via the user interface or with new xml’s, and get further inspired to build your own!

Let me know if you have any problems accessing these zip file, or if you have any further questions about their use. — and let me know if you would like to also share your own creative bundles!

–ernie

Note: This site doesn’t allow me to upload .zip files, so the files at these url’s have been renamed with “.ppt” as an additional suffix. Just rename them after download. They are normal .zip files.

Messaging Use Case

Messaging_bundle_and_content.zip

Abstract “Access Control” Use Case

accessControl_bundle_and_content.zip

Transformation Tool Use Case

PixelStage_bundle_and_content.zip



Apache Atlas…a Common Metadata Initiative with “legs” ?

$
0
0

Lately there has been increasing discussions about Apache Atlas, (http://atlas.incubator.apache.org), an open source initiative for metadata and governance services.

Standards in the technology industry come and go.   Some make it and enjoy wide adoption; others do not, failing early or never really blossoming to their full potential.   Our industry is littered with examples that had promise but withered away because
vendors were unable to agree on common semantics or unwilling to let go of (or expose) proprietary intellectual material.  Meta models represent a significant investment, and often competitive leadership.  No one wants to yield hard earned territory, or potentially give away the “golden key” to their solution.   Standards like xmi, cwm, and others in the data integration and business intelligence space never fully delivered the nirvana that people hoped for.   They lacked the commitment, weren’t pushed hard enough by customers writing the checks, and capability wise, typically considered by many as nothing more than a “checkbox” requirement.  Certainly, competing vendors in niche data integration areas couldn’t stomach having their meta models shared interchangeably.

The climate for this is changing now.   Thanks to big data and open source, and trends such as the adoption of Hadoop in everyone’s sandbox (even if not in production).   Not participating, or flat-out ignoring open source, is no longer acceptable.   Being “open” is no longer a vendor liability, but a competitive advantage.   Not being open is a path to extinction. For these reasons and more, Apache Atlas is poised to be a major force in the drive for common information governance and metadata management.

Please take the time to read the blogs from two of our highly respected colleagues here at IBM (IBM Fellow Tim Vincent and Distinguished Engineer Mandy Chessell) regarding Apache Atlas and what it will mean for our industry:

insightout-metadata-and-governance

insightout-case-open-metadata-and-governance
That’s all wonderful news on the potential for Apache Atlas.  What does it mean for the InfoSphere  Information Governance Catalog (IGC)?

Along with other contributors from the vendor and user community, IBM is committed to the success of Apache Atlas.   Although still early in its incubator status at Apache, Atlas is already being implemented at customer sites for their hadoop based assets.   And while Atlas is not specifically limited to hadoop, today this is the primary domain where it plays and will mature.

In the meantime, Information Server customers using IGC want to use Apache Atlas to help federate the metadata in their hadoop distributions so that it participates in their enterprise governance ecosystem.   Atlas shows evidence of eventually supporting distributed and clustered configurations, but sites are looking to do this right now — by bringing Atlas metadata directly into the Catalog.   OpenIGC, the methodology and API for extending the IGC repository, makes this possible today.  Several customers, as well as IBM, are looking into how the two can be integrated.  Each technology supports a robust REST API, and describes similar constructs that can be illustrated in each, either directly or by extending the underlying default models.  Pulling Atlas metadata into IGC allows it to immediately participate in data lineage reporting, be assigned to subject matter experts and Stewards, related to data quality statistics, and to be connected to approved policies for data management and governance.   Sites can immediately reap the benefits of IGC in combination with their hadoop based Atlas investments, while still looking to the future and the benefits that Atlas holds for even deeper governance capabilities and participation by a vast number of vendors and technology owners.

Lots to do, and lots to keep track of!  But many things that can be done “right now” to take advantage of, and garner insight, into the future.  Stay tuned.  Atlas is moving towards becoming a standard with legs we can all stand on….

Ernie


Evolving Atlas…

$
0
0

Apache Atlas is continuing to evolve, and quite quickly (see an earlier post about Atlas, including links to this open source initiative and other valuable commentary… Apache Atlas…a Common Metadata Initiative with “legs” ?).    Going beyond merely storage and process-based metadata, the Apache project is poised to introduce the ability to define a business taxonomy that increases common understanding and further defines assets across the enterprise.  The important inclusion of business vocabularies ensures that information governance incorporates the needs of ALL members of an organization, and not just IT.

As Apache Atlas takes on greater roles and open source accelerates its uptake, we can foresee a future where Atlas is called upon whenever and wherever data is accessed.  In her latest blog, Mandy Chessell floats the idea of a Connector Framework for Apache Atlas [http://www.ibmbigdatahub.com/blog/insightout-role-apache-atlas-open-metadata-ecosystem.]   Connectors of all kinds can access Atlas at the exact moment that they harvest or act upon data, with the ability to make decisions using everything that Atlas has to offer — ownership, location, data quality statistics, lineage, usage requirements and rules, and more.    This allows Apache Atlas to be more “intimate” with the data integration life-cycle and able to deliver governance rules that have real “teeth”.   –ernie.


IBM Partners with Creative Solutions Using Open IGC !

$
0
0

Many of you come to these pages to understand how to extend the Information Server repository and use the various Information Governance Catalog APIs to enhance your users’ experiences and increase your governance capabilities.   But for some of you, there are too many interfaces, not enough time, not enough resources (or the right skilled resources) to complete the effort.   Please let me introduce you to various trusted IBM partners who have been trained on, and are using,  Open IGC and related techniques to help customers around the world reach their information governance goals.  Many of these partners have built formal “bridges” from various 3rd party tools, to automate the metadata import process, and most of them also offer expert consulting on IGC and governance strategies in general.

To our partners…thank you for your efforts to spread the word about Open IGC and for helping our customers make even greater progress towards their governance objectives.

To our customers…I invite you to visit these partners’ web pages, ask them about how they can assist you with Open IGC and IGC issues in general, and challenge them to further expand their offerings to extend the repository for all your governance needs.

To our future partnersif you have built or are building a creative solution for achieving governance with the Information Governance Catalog, reach out to myself or my IBM teammates around the world so that we can introduce your efforts to the overall IGC community and ensure your listing is on this page.

Thank you!      –ernie

 

Compact Solutions  http://www.compactbi.com/solutions/data-lineage/

Compact_logo_GIF

 

Lucid  http://www.lucidtechsol.com/for-banner-3/

Lucid Logo

 

 

 

 

Manta  https://mantatools.com

manta_logo

 

Prolifics  http://www.prolifics.com/solutions/information-management-analytics

Prolifics_NEWLOGO_BLACK1

 

 

 

 

 


Apache Atlas: “your first look!”

$
0
0

Hi Everyone.

Just finished uploading the initial video in a series of recordings concerning Apache Atlas, the evolving open source initiative for metadata management and governance in hadoop.

This recording is primarily designed for viewers who aren’t comfortable doing their own builds of open source solutions and also need some guidance on how to get started with vmware images that are available for download.  It introduces the concept and helps validate what needs to be done so that the viewer can be successful with available Apache Atlas resources on the web.  It starts with the download of existing images at the Hortonworks web site, and helps validate your environment so that you can continue with tutorials that are on the Hortonworks site, and/or start playing and exploring on your own.  This is the first in a series of recordings on Apache Atlas that share early experiences and discoveries regarding this important open source initiative for governance and metadata management in hadoop.

Recording can be found at:  https://youtu.be/C4lf_EFduqU


Check out this “Recipe” for integrating Oracle ODI metadata into IGC!

$
0
0

Hi Everyone…

An IBM colleague has published an excellent use case on constructing an OpenIGC bundle  and publishing metadata and lineage for ETL processes represented by Oracle ODI.  She very nicely shows how to illustrate important structures and properties of a 3rd party ETL tool.   Ultimately, this leads to publishing of actual metadata instances so that IGC users can perform lineage reports and also “govern” (assign Terms, Stewards, etc.) their critical metadata.

Enjoy!

-ernie

https://developer.ibm.com/recipes/tutorials/creation-of-new-bundle-on-infosphere-information-governance-catalog/


Apache Atlas: GET-ting familiar with the REST API

$
0
0

Hi everyone.  Just posted the second in a series or recordings related to Apache Atlas, the Open Source initiative for metadata management and governance for hadoop.  Many of you have been asking about how to get metadata “out” of Apache Atlas so that you can load it into IGC or other repositories, or just use it for special governance reporting purposes.   In this recording we take a quick look at some of the key “GET” functions of the Apache Atlas REST API, and how you can easily do testing and prototyping of these calls using only your browser.   –ernie

https://youtu.be/6Us2zG-WvS8

 


Tech Talk on OpenIGC !

$
0
0

Hi all…  wanted everyone to hear about the upcoming “Tech Talk” that is scheduled for next week.   Marc Haber, Offering Manager for our metadata offerings, will be presenting, while myself and others will be monitoring the chat room for questions and discussion.

Here are the details:

Event Name : Information Governance Catalog
Event Date : Wednesday, Sept 14
Event Time :  1 PM – 2 PM US (EDT) Eastern Daylight Time
Presented by : Marc Haber, Offering Manager
This presentation will provide a comprehensive overview of ability to extend the Information Governance Catalog and support governance across new and alternate Data Sources or Systems. Understand how customers satisfy their requirements for a comprehensive Governance implementation or metadata management system with Information Governance Catalog. We will explore the process for defining and structuring new Asset Types and publishing information specific to Assets. Lastly, explore the process to govern such Assets, lending meaning thru Glossary Terms, documenting requirements thru Governance Rules and mapping information to support Data Lineage and Compliance Reporting. This topic will be presented by Marc Haber, Offering Manager for Information Governance Catalog and Data Governance in general across Information Server.  Marc has extensive experience with Business Glossary, Metadata Workbench and Governance Catalog – helping customers implement governance initiatives or satisfy metadata management requirements. 

Registration –
https://www.eventbrite.com/e/ibm-tech-talk-is-open-igc-tickets-27329302680

Password:  Governance



Tech Talk on Information Analyzer: Virtual Tables

$
0
0

Hi all.

Just wanted to pass along news of another Tech Talk.  This one on Information Analyzer and Virtual Tables.   Here are the details and the link to Eventbrite to register…

October 20, 2016
Time: 9:00 EST
Topic:  Virtual Table Feature in Information Analyzer

This presentation will provide a comprehensive overview of Virtual Table feature in Information Analyzer.  A Virtual Table is essentially a way to filter / limit the data from the source repository while performing IA Analysis like Column Analysis, Key Analysis, and Data Rules Analysis etc. The concept of Virtual Table is available from version 8.1 in IA workbench with quite a few limitations. A new type of Virtual Table called ‘SQL Virtual Table’ is introduced from 11.5 which eliminates all the limitations and allows users to define any complex SQL queries to filter the data during IA analysis. It also allows users to query exceptions directly from the source repository with the known queries. A SQL Virtual Table can be only defined using IA REST API / CLI at this moment. In this session, we will also see a demonstration of this feature.

Who should attend this session? – For all skill levels of current and prospective Information Analyzer users from both IT and line of business.

This topic will be presented by Suresh Tirumalasetti, Software Developer, Information Analyzer.  Suresh has extensive experience software development, customer support especially in Information Analyzer.  Suresh is located in Bangalore, India.

To attend you must register here:

https://www.eventbrite.com/e/virtual-table-feature-in-information-analyzer-tickets-28227856278

Password: Governance


Accessing IGC via cURL

OpenIGC Accelerator

$
0
0

Hi Everyone…

Happy Spring! [for those of you in the northern hemisphere  ; )  ].   Great time to start “cleaning out” and “fixing up” things….whether around the house, or in the corners of our special projects.    In that latter category, I have “tidied up” a little utility I have been working on to assist everyone in building their OpenIGC prototypes or to assist in “getting to know” OpenIGC — a “form builder” for the “Publishing XML” needed to realize instances of your newly modeled and registered OpenIGC artifacts.

A lot of you have expressed the desire to get deeper into OpenIGC, but have found it difficult to get your arms around the xml aspects of it.  Either that, or cutting and pasting xml in a text editor is just not your thing.   For those reasons and others, I have been exploring various ways that a user interface could be created for OpenIGC assets — without resorting to an elegant albeit complex and lengthy GUI development effort.

Digging around, I found some open source javascript tooling to assist, and brushed off enough javascript and html skills to put it together.     At the url listed below you will find a tool that allows you to upload your bundle descriptor and generate a self-populating “form” to construct a publishing xml document for OpenIGC.   It also provides options to save the publishing xml to disk (for future use/editing) or to directly cut and paste into the igc-rest-explorer page.

It’s not “perfect” (I suspect it probably has its share of anomalies if you click on things out of order), but is hopefully a “helper” that will accelerate your efforts to implement custom assets for governance within IGC.

Please carefully READ the instructions (there is a link to instructions and a simple screen shot on the initial page).    The tool does not entirely “hide” your xml, and it REQUIRES that you understand your bundle (if you don’t know what I am talking about regarding OpenIGC and bundles, please review the blog series starting with https://dsrealtime.wordpress.com/2015/07/29/open-igc-is-here/ )! ….still, it does a few nice things for you:

  • Performs all the xml tagging/formatting, ensuring that your xml remains “well-formed”
  • Presents a “pull-down” select list for your classNames and attribute enumerations
  • Generates the list of attributes (properties) for whatever class you select
  • Automatically generates the unique “assetIDs” for the asset instances that you define
  • Generates and presents a pull-down list for selecting “parent” assetIDs

As noted above, I can’t promise that it is entirely bug-free, but I can say that it has already helped me accelerate the prototyping of several bundles that I have been building recently to illustrate the power of OpenIGC for extending the repository.    Have fun, good luck, and please let me know how you make out in using this tool!       –ernie

http://www.openigcaccelerator.com

 


Re-defining Data Lineage

IBM and Hortonworks!

$
0
0

Hi everyone…

Some exciting recent news, if you haven’t seen it yet…announced a few days ago at the DataWorks Summit/Hadoop Summit in San Jose, a new relationship between IBM and Hortonworks!   Read about it here to learn how IBM and Hortonworks are partnering to further the efforts of our customers to expand their big data solutions.

http://www-03.ibm.com/press/us/en/pressrelease/52572.wss?platform=hootsuite

More important for this blogger is the increased attention this brings to Apache Atlas.  Apache Atlas, if you aren’t already familiar, is an evolving open source approach to enterprise information governance, metadata management, and lineage […go here for a general overview:  https://hortonworks.com/apache/atlas/ ].   One highlight from news above draws particular attention to the contributions IBM and Hortonworks are making to this effort:

“Partnering On Apache

As part of their wide-ranging partnership, the companies will also team to advance the development of Unified Governance (IBM BigIntegrate, IBM BigQuality and IBM Information Governance Catalog) on the Apache Atlas open platform. Information Governance Catalog) on the Apache Atlas open platform. …”

It’s all a work-in-progress, but this is significant news that will hopefully accelerate the initiative.   Have any of you started working heavily with Atlas?   Which release?  Are you using it exclusively with Hadoop, or externally?   Have you interchanged metadata with Atlas and IGC?  Considering it?    Share your experiences!

Ernie

Related posts:

Evolving Atlas…

 

 

 


Viewing all 51 articles
Browse latest View live