Quantcast
Channel: Tracing Enterprise Data Footsteps! ……celebrating the journey of data!
Viewing all 51 articles
Browse latest View live

Reviewing the Advanced Tab in the Metadata Workbench

$
0
0

Hi all…

Just thought I’d throw in a quick review of the important (imho) links at the Advanced tab…..some of these factoids are buried in my other posts, but I needed to have a cheat-sheet for myself and others. Here is is:

Automated Services. This option brings up the dialog that runs the parsing or “stitching” process for the detailed metadata you have in your DataStage Jobs or Connector-imported rdbms views. It does a lot of stuff, takes time the first time you run it (if you have a ton of metadata), and should be scheduled during off-hours. After the first run against a particular project, it uses a change recognition mechanism to only pick up Jobs that have been updated. Note the “checked” DS Projects carefully. Only select those that are really critical, and once checked, don’t “uncheck” — as you will see from the warnings, this will “remove” all parsing history. Ultimately, this step is the one that reviews the Jobs, connects them via common information found in Stages, etc. See my other posts for how the connection of Jobs to each other is determined.

Stage Binding. When all else fails, you can connect two stages to each other. Use this when, for some reason, two Jobs won’t connect, or when the rules for connecting them can’t be met. I’ve needed this with some custom Stage or Operator implementations, and when I am using a technique that prevents automatic connection. Imagine having a Sequential Stage at the end of a Job that is writing out some xml content — and then I’m using the XML Stage in the next Job to read that content. There isn’t much in common between those Jobs, but I still want lineage to run directly thru them…

Data Item Binding. This provides a “manual” binding of particular Stages to Database Tables and Data Files (see other posts for what those are, how they are created, and how they are different from “DataStage Table Definitions”). Use this when you are unable to get Database Alias to work as you expect and you simply want to “bolt” a particular Database Table or Data File to a Stage in one of your Jobs to complete the lineage picture.

Data Source Identity. Use this when, for whatever reason, you want to link two identical tables for lineage purposes. Reasons? Two people might have imported the same metadata accidentally and you don’t want to delete it….or you might have the “design” information from an ERwin model and also have the “actual” table information from the rdbms catalog. There are many valid reasons. This link let’s you relate tables together. They must have the same name — the option here lets you relate the “Schemas” of two different databases. Identical tables within those schemas will become linked for lineage reporting — and therefore, also linked to whatever those individual tables connect to for lineage.

Database Alias. This option establishes the connection between an abstract string in a DataStage Stage (Server name, DSN name, etc., as defined by the relational stage) and the “Host/Database” combination that was actually imported. Database Tables in Metadata Workbench are typically “actual” tables — but in DataStage, like any well designed application, the “name” is a placeholder. This assigns the “placeholder” to the host and database. The schema.tablename used in the Stage will then be matched against the Host/Database set of Tables to create a lineage connection. The list presented at this option will be entirely empty until you perform Automated Services. Then it will be populated with each StageType and “server string” combination that it finds in your Jobs.

Hope this helps understand these options.

Ernie



New developerWorks article on DataStage and new XML Stage!

Actional Diagnostics…great to use an “old friend” again…

$
0
0

Hi All…

Just a quick note. I had the pleasure today of finally getting around to installing Actional Diagonostics. This is the latest release of what used to be called “SOAPScope”…it’s been awhile since I’ve had a need for it…but it was perfect, still providing all the great things that it did in its earlier implementations.

This is something I’ve been meaning to do for a long time, but haven’t had the chance. Mindreef was acquired by Progress Software several years ago, and what was originally “SOAPScope” has been rebranded. I don’t have any specifics as to the other Actional offerings that are connected, but I can say that the experience was excellent. The download and install went very smoothly, and I was invoking a service within minutes. The screens appear to be the same, although they have probably included new functionality that I have yet to explore.

If you need an easy to use testing tool for your services, put this one on your list. There are many of them out there….all good tools….. I happen to have always liked this one because it offers a good compromise — it will appeal to users like myself who are comfortable with xml and http protocols, yet also be easily adopted by users who don’t want to be exposed to xml and simply want an easy-to-use GUI. Especially nice in Actional Diagnostics is the ability to perform load testing, where you can easily create multiple threads (thus simulating multiple users) invoking your service in concurrent fashion.

Bravo to the team who is still supporting this.

You can find the download details at http://web.progress.com/en/actional/

Ernie


Check out this “can’t miss” recording of metadata mgmt and governance in action!

$
0
0

One of our IBM partners, Compact (www.compactbi.eu and www.compactbi.com ), has published a video recording on their web site that illustrates their MetaDex solution. MetaDex, in combination with Information Server Business Glossary and Metadata Workbench, enables metadata management for additional technologies that are outside of Information Server. Parsing for independent ETL tools and complex SQL scripting are just part of what MetaDex offers.

This is an excellent video that is worth eight minutes of your time. It highlights the functionality of Business Glossary and Metadata Workbench and how they work together along with MetaDex to provide a strong solution in support of your governance objectives. Enjoy!

Ernie

The recording is specifically at: http://compactbi.eu/solutions/metadex


New YouTube Channel created for Information Server content…

$
0
0

Hi Everyone…

Check out this new YouTube channel…it’s first entry is a recording I put together to illustrate how to publish a DataStage or QualityStage Job as a Service…

http://www.youtube.com/channel/UCVFAoFT_zaVF_JWHGz-8d5w?feature=guide

My colleagues in product management and marketing are managing the channel and encouraging myself and others to put together all kinds of videos….demos of new and exciting features, or recordings that illustrate “how to” do something. I hope it is a resource that we all find useful going forward.

Ernie


New RedBook for XML Stage is available!

$
0
0

The new redbook is available for the enhanced XML capabilities introduced by the “XML Stage” in Release 8.5 in October of 2010. It represents a lot of hard work by my colleagues who work with, developed, and tested this enhanced way of processing XML content in an ETL tool. Congrats to then entire authoring team, the reviewers, and the people who made publication of the Redbook possible — and congrats to the rest of of us who now have another excellent resource for reading and writing complex XML using DataStage, QualityStage, and Information Server!

You will find this new redbook here:

http://www.redbooks.ibm.com/abstracts/sg247987.html?Open

Ernie


Lineage for RDBMS “Views”

$
0
0

Hi Everyone…

Someone asked me yesterday about being able to perform lineage in Metadata Workbench on a database “View”. It dawned on me that I may have never created a post on this very important subject. Formal Database views, as created in the rdbms catalog via a CREATE VIEW sql statement, are fully supported for Data Lineage purposes by Information Server and the Metadata Workbench.

The key is in the method of import.

When you import your rdbms catalog information via Connector (in 8.7 Metadata Asset Manager, or in any 8.x release via DataStage, Information Analyzer, or FastTrack), the views are imported and get their own icon for display purposes within the “Hosts” tree (or Implemented Data Resources if you are in 8.7). The details of the view are available for display, and will show the SQL used in the CREATE VIEW statement to originally create it.

More importantly, when you perform “Automated Services” (a.k.a. “stitching”, or in 8.7, “Detect Associations”), Metadata Workbench will parse thru the SQL of the CREATE VIEW and establish data lineage connections to the “source tables” of the view! Once this is done, you will have lineage for the view back to its source tables, and of course, anything that is upstream from those tables, or downstream from the view itself!

Ernie


Are Your Kids Addicted to Minecraft?

$
0
0

Are your kids playing Minecraft yet? ….or should I say “Are your kids addicted to Minecraft?” This game by Mojang (www.minecraft.net) is a creative game where you put your imagination and creativity to work to build your own “world” using square blocks that you construct or dig and destroy to make tunnels, walls, caves, houses, mountains and more. The square blocks themselves represent many elements (stone, iron, coal, etc.) and in combination, lead you to the discovery or building of new things that you need to expand your empire. You find blocks of iron ore as you are “mining” (basically blowing up blocks of earth as you create new paths) and in combination with coal that makes a stove, you can obtain iron and gold utensils. Of course, to keep things interesting there are monsters and spiders that can kill you as the game turns from day to night, so you need to construct various shelters in order to survive.

I may have some of the facts incorrect — I have only watched the game, not played it, but am amazed by the phenomenon and how “catching” it has been for 10 to 13 year olds around the world. Nearly every parent I speak to has to remind their children to put down their iPad or iTouch or iPhone when they’ve played Minecraft for too many hours, or to turn off their Xbox or other computer that is running other platform versions of the game!

What makes this game so addictive? My only reference is our own creation of mini-worlds when we were kids. What kept us outside for hours, with our toys, designing private and public spaces and sharing them with our friends. For me it was Matchbox cars and trucks and Hot Wheels, with an occasional GI Joe, platoon of plastic soldiers, or Tonka truck mixed in. My best friend and I would build elaborate “villages” in and around the garden, with pachysandra posing as giant palm trees. We used sticks and pebbles to mark parking lots, driveways, and highways, and a small patch of sand at the other end of the garden served as our “remote gravel pit.” This was our own “mining” operation where our special mission trucks would go on “expeditions”. Mojang has clearly tapped into that same experience, only now it is on a touch screen, able to be played anywhere, and shared over the web between children who teach each other new things and proudly show off their new designs. It can also be played anywhere and at any time. Video games exist with far more spectacular graphics and intricate plot lines but bravo to Mojang for delivering a platform that inspires the imagination with basic simplicity while allowing for an infinite array of unique and challenging experiences.

Why write about this on “this” blog?

Watch some youngsters playing this game for awhile. It won’t be long before you are amazed at the speed at which they build/destroy/re-build/tear down and continue to evolve their world. All the time looking out for dangerous spiders or avoiding “Creepers” that will blow up and kill their game character if they step too near. Players make quick decisions about where to dig, what to build, or whether to leave a cave without knowing if it is nighttime (players learn early that monsters come out at night). How do they achieve this speed? Practice of course, but also through collaboration with their peers. They can play on the same network and play within each other’s worlds. They learn a whole new vocabulary and continually learn from others where to go (within their own groups or on the web). Some will accuse me of making a leap here, but purely for fun, this is governance in action. New terminology is shared by everyone in the Minecraft community (do you know what “Creepers” are, or how to get Glowstone and Blaze Rods out of “The Nether”?), helped along by Stewards (check out YouTube — there are 100’s of tutorials and videos out there from experienced “guides”) and metadata galore as players manage “Chests” full of artifacts collected and made, along with accurate counts of their inventory. Lineage is a bit of a stretch and not a concept you can directly apply to Minecraft, but there is a cottage industry for recording software that will let you create videos of a trip through your world or a fight with monsters.

If you are still reading this blog entry and don’t know anything about metadata, I hope you enjoy watching or playing Minecraft with your kids, providing it doesn’t push every other important activity out of the way! …and if you are into governance, I hope you had fun with the analogy and enjoyed a brief respite from metadata and governance in our technical realm. :)



Best Practices and Techniques for the “New” XML Stage

$
0
0

Hi Everyone…

It’s been awhile since I’ve posted anything.   A certain amount of “blog” fatigue is to blame, but also because I like to post things that are (as much as possible) proven, time-honored (and not release dependent).   Many/most of the techniques I write about here are ones that I’ve spent many hours helping customers and colleagues implement in real situations.

It’s time I write about the XML Stage.  It is not-so-new anymore, but still feels new as it has had some very important xsd handling features added to it in the last few releases.   This week I will start posting suggestions and tips for using the XML Stage to read and write xml documents using DataStage.  

I’ll start with a pointer to a valuable RedBook that came out last year regarding the XML Stage.  I had the pleasure of reviewing the material as the authors put it together.  It is a great place to start when learning to work with this important Information Server capability.

XML Stage Redbook

Ernie

…and a link to the first “New” XML Stage post…

Establish Meaningful Link Names when using the XML Stage!


XML Stage: Establish Meaningful Link Names

$
0
0

…and then stick with them! Decide early what you want your Link names to be, before you even open up the Stage and begin your work on the Assembly, and then lock them in. Make a conscious decision not to change or alter them. Why? The XML Stage is not immune to Link name changes like other Stages and Connectors on the DataStage canvas.

How many of you are perfectly happy with DSLink2 and DSLink35 or other automatically generated Link names? I know I don’t spend time on every Job, running around putting on fancy Link names, especially when I’m first building it. It’s nice for documentation, and I know that I should always create meaningful names, but how many of us do?

And how often do we “go back” and edit the Link names later? That’s actually a good thing — for most Stages and Connectors. But for the XML Stage, it is something you want to avoid. Changing Link names will break your Assembly and require that you edit the stage and make changes.

Here is an example of the XML Stage reading xml documents from a subdirectory and performing validation. Valid xml will be sent down the “goodXML” Link, and rejected, invalid xml content will be send down the “badXML” link.

linknames

Notice how, inside the Assembly, these link names are used. Here in the Assembly Parser step, you see the toXML linkname used for the specification of the xml Source:

linknameParserStep

…and here, in the Assembly Output Step, you can see how the Link names are used in the Mapping:

linknamesOutputStep

Those screen shots illustrate how the link name becomes critical to the internals of the Assembly. If you change the link names outside the Stage, the Assembly will end up with errors (various red marks throughout the Assembly, depending on how complex it is):

redLink

Are you able to correct the Assembly when this happens? Of course…and for most scenarios, it’s not difficult…you might just need to change a setting or re-map a couple of columns. But save yourself the trouble. Decide on your Link names, set them up early (preferably before you ever enter the Stage) and then don’t touch them!

—Ernie


Always use an Input Link AND an Output Link with the XML Stage

$
0
0

Another way to say this is to “avoid using the XML Stage to perform i/o.” The XML Stage is capable of reading an xml document directly (a feature in the Parser Step) and is also able to write to a new document on disk in the Composer Step. However, while it may seem simpler to do that initially, it makes your Jobs and Stage designs less flexible and less re-usable. You should have an Input link that feeds XML to the XML Stage when you are “reading” or parsing xml (and of course you will have output links that send the results downstream), and you should have an Ouput link that sends your completed XML document(s) downstream when you are writing XML (and of course you will have input links that feed in the source data).

Let’s see why.

When you are first learning the XML Stage, it seems convenient to just “put in the name of the xml document” and keep going. The Parser Step allows you to specify the filename directly (or it can be parameterized), and then you continue with the assignment of the Document Root. Similarly, when creating a new XML document, the Composer Step allows you specify the actual document to be written to disk.

Then someone comes along and says “Our application is changing. The xml documents we currently read from disk will now be coming from MQ Series…” …or maybe “…from a relational table” …or “from hadoop”…. Well, you can’t just “change the Stage type at the end of the link” in that case. You have to “add” the link, and then make what could potentially be extensive changes to your Assembly. While not especially difficult once you are familiar with the Stage, if you have moved on to other projects, or have been promoted and are no longer supporting the Job, a less experienced DataStage developer will be challenged.

So…when using the Parser Step, use one of the options that describes your incoming content as either coming in directly as content (from a column in an upstream Stage), or as a set of filenames (best use case when reading xml documents from disk, especially when you have a whole lot of them in a single sub-directory [see also Reading XML Content as a Source ] )

xmlParserStepOptions

The same thing is true for writing XML. Send your xml content downstream — whether you write it to a sequential file, or to DB2, or to MQ Series or some other target, the logic and coding of your XML Stage remains the same! In the Composer Step, choose the “Pass as String” option and then in the Output Step, map the “composer result” to a single large column (I like to call mine “xmlContent”) that has a longvarchar datatype and some abitrary long length like 99999. While there may be times when this can’t be easily done, or when you need to use the option for long binary strings (Pass as Large Object), for many/most use cases, this will work great.

xmlComposerStepOptions

Get in the habit of always using Input and Output Links with the XML Stage. Your future maintenance and changes/adaptions will be cleaner, and you can take better advantage of features such as Shared Containers for your xml transformation logic.

Ernie


Creating Data File objects from inside of DataStage

$
0
0

A seldom used object in Metadata Workbench is a “Data File”. It is not as common because it has to be manually created. Database Tables are created whenever you use a Connector or other bridge to import relational tables from a database. Data Files, however, can only be created manually, using the istool workbench generate feature, or from inside of the DataStage/QualityStage Designer.

Why create Data Files?

A Data File is the object available in the Metadata Workbench that represents flat files, .csv files or DataSets. It is able to connect to the Sequential Stage or Dataset Stage for data lineage purposes. A Data File object might be used also for pure governance reasons — a special transaction file might be defined by a particular Business Term, or you might want to assign a Steward to the Data File object — the subject matter expert on one particular file. Of course, if you are a DataStage user, you probably use regular Sequential Table Definitions all the time. Data Files are similar but are more “fixed” — they are designed to represent a specfic flat file, on a given machine, and in a particular sub-directory, as opposed to being a general metadata mapping with proper column offsets for any file that matches the selected schema.

The simplest way to create a formal Data File is to start with a DataStage Table Definition. You may already have one that was created when you imported a sequential file, or can easily create one using the “Save” button on any column list within most Stages. Once you have the Table Definition, double click on it. Review all of the tabs across the top. Pay special attention to the “Locator” Tab. Click on that one. Look at its detail properties. Values at the Locator tab control the creation of Data Files or Database Tables.

Set the pull-down option at the top to “Sequential”. If that value is not already in your pull-down list, type it in… Towards the bottom you will see an entry for the Data Collection — put in the name you want for your file. Close the Table Definition.

Now put your cursor on that Table Definition in the “tree”. Right mouse and select “Shared Table Creation Wizard”. When that dialog opens, click Next. Then open the pull-down dialog and select “create new”, and click Next. Notice the properties at this new page….you have the Filename, the Host (pick a machine or enter a new one) and Path. Make the filename the SAME as what you have hard coded in your Sequential or Dataset Stage, or the filename of any fully expanded Job Parameter default values that you are passing into it. Then set the “Path” value to the fully qualified path of the expanded Job Paramters or what you have in the same filename property. For example, if your filename in the Stage looks like this:

/tmp/myfile/#myfilename# …and #myfilename# has a default value of mySequentialFile.txt

Then use mySequentialFile.txt as the Filename and /tmp/myfile (without the final slash) for the path. Now you will have a Data File inside of Metadata Workbench that you can govern with Steward and Term assignments, and it also will stitch to the Stages that use its name in hard coded fashion or expanded Job Parameters for Design time or Operational lineage.

Ernie


Business Glossary and Cognos — Integrated together…

$
0
0

Hi Everyone…

Just wanted to share a video I completed today that illustrates the integration of Cognos reporting with InfoSphere Business Glossary…. showing a user inside of Cognos, using the right-click integration that Cognos provides to do a context-based search into Business Glossary to display a term, and then navigate further through the metadata to find details about a value and concept in the report.

This is very much like the Business Glossary Anywhere, except that it is a capability built directly into the Cognos Report Studio and Cognos Report Viewer. Enjoy!

Ernie


Building Metadata Extensions for Information Server: Why?

$
0
0

Lately I have been working with a lot of sites who are interested in “Extensions”. Extensions are simple ways of defining new objects within Information Server, and/or tying them together for data lineage purposes.

Extensions come in two different flavors. There are Extended Data Sources, which are the equivalent of defining your own tables, columns, files, or other “things” that you want to appear as individual “icons” in your lineage diagrams. The others are called Extension Mapping Documents, which are the specifications that define sources and targets (along with other useful metadata properties) and describe the “lineage” that will be drawn by the Metadata Workbench when performing any type of lineage reporting.

Why create them? Doesn’t Information Server allow imports of tables, columns and files, and other artifacts in our environments? Doesn’t DataStage provide me with data lineage, describing complex flows of data?

The answer to that question depends largely on what you are trying to accomplish with your Information Governance objectives. If you are only narrowly concerned about the DataStage Jobs in your application, and the datamarts that they flow to, there may not be a need for Extensions. However, many of you are expanding your horizons beyond just DataStage, and looking at all of the other elements of your enterprise that need tracking, management, oversight, and governance. Such sites are looking to include in their lineage ALL of their objects — not just the tables and columns defined in their relational databases, but also the legacy objects, the message queues, the green screens, the CICS transactions or even the illustration of “people”, so that Tweets and other social media feeds can be shown as the “source” in a lineage diagram that ends up in Hadoop! Those same sites also need to outline the processes that move and transform data, whether they are DataStage, another ETL tool, shells, FTP scripts, java or other 3gl programs.

Every one of those objects may be important to lineage, especially when there is a need to provide detailed source information to upper management. Equally, those objects also demand governance — such as being assigned Stewards, becoming associated with business concepts and Terms, or shown as “Implementing” a particular data quality “Policy” or “Rule”. Further, such objects benefit by being categorized, labeled, or otherwise organized into Collections that make them more useful to everyone who is in need of further definition and deeper understanding. Anyone who “touches” a piece of data, whether it is for development, evaluating a report, or making a crucial decision will benefit by the addition of Extensions.

Several years ago I talked about Extensions as a way of defining an external Web Service (Data Lineage and Web Services). This is just one example of a flow, outside of normal ETL, that has value in being tracked and managed. I have worked with many customers who have defined other ETL tools for lineage, with or without DataStage. Always the goal is to provide more insight to decision makers who need to know where things come from, how they were calculated, who the experts are (and more).

Building Extensions requires first thinking far outside the box — and looking at “all” the metadata that is important to your data integration efforts. What is the metadata that will be meaningful to those business users? Certainly also, there is the need for impact analysis and providing value to your developers who want to answer questions such as “which processes use this table?” Which processes will be affected if we make changes to this MQ Series queue definition?

These are some of the key reasons “why” people are creating Extensions. There is a lot of “built-in” metadata that exists within Information Server. However, you can extract even MORE value from your Information Server investment by adding new objects and new capabilities to the collection of metadata that you are already successfully managing.

Next post will suggest ways to decide which extensions you need, and then we’ll dive into how to create them and what you should consider…

Ernie

Next post in this series….Methodology for Building Extensions


Methodology for Building Extensions

$
0
0

Hi Everyone…

In the last post I talked about “why” Metadata Extensions are useful and important (Building Metadata Extensions….”Why?”). Today I want to review the basic steps in a high level “methodology” that you can apply when making decisions about extensions and their construction that will help you meet the objectives of your governance initiatives.

Step 1. Decide what you want to “see” in lineage and/or accomplish from a governance perspective. Do you have custom Excel Spreadsheets that you would like to illustrate as little “pie charts” in a lineage diagram, as the ultimate targets? Do you have mainframe legacy “green screens” that business users would like to see the names of as “icons” in a business lineage report? Are there home grown ETL processes that you need to identify, at least by “name” when they move data between a flat file and your operational data store? Lineage is helping boost confidence to the users, whether they are report and ETL developers, or DBAs tracking actual processes, or reporting users who need some level of validation of a data source. Which objects are “missing” from today’s lineage picture? Which ones would add clarity to the users “big picture”. Each of the use cases above represent scenarios where lineage from the “known” sources (such as DataStage) wasn’t enough. There are no industry “bridges” for custom SQL, personalized spreadsheets, or home grown javascript. And in the green screen case, the natural lineage that illustrated the fields from a COBOL FD left business users confused and in the dark.

The “…accomplish from a governance perspective” in the first sentence above takes this idea further. The value of your solution is not just lineage — it will be valuable to assign Stewards or “owners” to those custom reports, or expiration dates to the green screens. Perhaps those resources are also influenced by formal Information Governance Rules or Terms in the business glossary. The need to manage those resources, beyond their function in lineage, is also something to measure.

Step 2. How will you model it inside of Information Server? Once you know which objects and “things” you want to manage or include in lineage, what objects should you use inside of Information Server? The answer to this is a bit trickier. It requires that you have some knowledge of Information Server and its metadata artifacts, how they are displayed, which ones exist in a parent-child hierarchy (if that is desirable), which ones are dependent upon others, what does their icon look like in data lineage reports, etc. There aren’t any “wrong” answers here, although some methods will have advantages over others. There are many kinds of relationships within Information Server’s metadata, and nearly anything can be illustrated. Generally speaking, if the “thing” you are representing is closest in concept to a “table” or a “file”, then use those formal objects (Database Tables and Data Files). If it is conceptual, consider a formal logical modeling object. If it looks and tastes like a report, then a BI object (pie chart) might be preferred. If it is something entirely odd or abstract (the green screen above, or perhaps a proprietary message queue), then consider an Extended Data Source. I’ll go into more details on each of these things in later posts, but for now, from a methodology perspective, consider this your planning step. It often requires some experimentation to determine how best to illustrate your desired “thing”.

Step 3. How much detail do you need? This question is a bit more difficult to answer, but consider the time-to-value needed for your governance solution, and what your ultimate objectives are. If you have a home grown ETL process, do you need to illustrate every single column mapping expression, and syntax? Or do you just need to be able to find “that” piece of code within a haystack of hundreds of other processes. Both are desirable, but of course, there is a cost attached to capturing explicit detail. More detail requires more mappings, and potentially more parsing (see further steps below). A case in point is a site that is looking at the lineage desired for 10′s of thousands of legacy cobol programs. They have the details in a spreadsheet that will provide significant lineage…..module name, source dataset and target dataset. Would they benefit by having individual MOVE statements illustrated in lineage and searchable in their governance archive? Perhaps, but if they can locate the exact module in a chain in several minutes — something that often takes hours or even days today — the detail of that code can easily be examined by pulling the source code from available libraries. Loading the spreadsheet into Information Server is child’s play — parsing the details of the COBOL code, while interesting and potentially useful, is a far larger endeavor. On a lesser note, “how much detail you need” is also answered by reviewing Information Server technology and determining things like “Will someone with a Basic BG User role be able to see this ‘thing’? “…..which leads to “Do I want every user to see this ‘thing’?”. Also important is whether the metadata detail you are considering is surfaced directly in the detail of lineage, or if you have to drill down in order to view it. How important is that? It depends on your users, their experience with Information Server, how much training they will be getting, etc.

Step 4. Where can you get the metadata that you need? Is it available from another tool or process via extract? in xml? in .csv? Something else? Do you need to use a java or C++ API to get it? Do you have those skills? Will you obtain the information (descriptions, purposes) by interviewing the end users who built their own spreadsheets? Is it written in comments in Excel? Some of the metadata may be floating in the heads of your enterprise users and other employees. Structured interviews may be the best way to capture that metadata and expertise for the future. Other times it is in a popular tool that provides push-button exports, or that has an open-enough model to go directly after its metadata via SQL. ASCII based exports/extracts have proven to be one of the simplest methods. Most often, governance teams are technical but do not often have resources with lower level API skills. Character based exports, whether xml, or .csv or something else, are often readable by many ETL tools, popular character based languages like PERL or similar, or even manipulated by hand with an editor like NotePad. I use DataStage because it’s there, and I am comfortable with it — but the key is that you need to easily garner the metadata you decided you need in the previous steps.

Step 5. Start small! This could easily be one of the earlier steps — the message here is “don’t try to capture everything at once”. Start with a selected set of metadata, perhaps related to one report, or one project. Experiment with each of the steps here with that smaller subset — giving you the flexibility to change the approach, get the metadata from somewhere else, model it differently or change your level of detail as you trial the solution with a selected set of users. Consider the artifacts that will have the most impact, especially for your sponsors. This will immediately focus your attention on a smaller set of artifacts that need to be illustrated for lineage and governance, and allow you to more quickly show a return on the governance investment that you are making.

Step 6. Build it! [and they will come :) ] Start doing your parsing and construct Extensions per your earlier design. Extension Mapping Documents are simple .csv files…no need for java or .net or other type of API calls. Adding objects and connecting them for lineage is easy. Extended Data Sources, Data Files, Terms, BI objects — each are created using simple .csv files, and/or in the case of Terms, xml. I suggest that you do your initial prototypes entirely by hand. Learn how Extensions and other such objects are structured, imported, and stored. As noted earlier, I will go into each of these in more detail in future posts, but all of them are well documented and easily accessible via the Information Server user interfaces. Once you have crafted a few, test the objects for lineage. Assign Terms to them. Experiment with their organization and management. Assign Stewards, play with adding Notes. Work with Labels and Collections to experience the full breadth of governance features that Information Server offers. Then don’t wait — get this small number of objects into the hands of those users — all kinds of users. Have a “test group” that includes selected executives, business reporting users and decision makers in addition to your technical teams. Get their feedback and adjust the hand crafted Extensions as necessary. Then you can move on and investigate how you’d create those in automated fashion while also loading them via command line instead of via the user interfaces.

Keep track of your time while doing these things so that you can measure the effectiveness of the solution vis-a-vis the effort that is required. For some of your extensions, you may decide that you only need a limited number of objects, and that they almost never change — and no future automation will be necessary. For others, you may decide that it is worth the time to invest in your own enterprise’s development of a more robust parser or extract-and-create-extension-mechanism that can be implemented as metadata stores change over time. This also makes it simpler to determine when it makes sense to invest with an IBM partner solution for existing “metadata converters” that already load the repository. These are trusted partners who work closely with all of us at IBM to build solutions that have largely answered the methodology questions above in their work at other locations. IBM Lab Services also can help you build such interfaces. When appropriate and market forces prevail, IBM evaluates such interfaces for regular inclusion in our offerings.

Ultimately, this methodology provides you with a road map towards enhancing your governance solution and meeting your short and longer term objectives for better decision making and streamlined operations via Information Governance.

-Ernie



DataStage and Minecraft (just for fun…)!

$
0
0

Hi Everyone!

Do your kids play Minecraft? Do you? Here is a “just for fun” recording of DataStage in a Minecraft world….. if you use DataStage and you and/or your family members play Minecraft, we hope you’ll enjoy this little adventure into the “world of Transformation”….. ; )

http://youtu.be/YFbLxbPuScA

Ernie


New Recording on DataStage/QualityStage Lineage!

$
0
0

Hi Everyone…

Our engineering team just posted a very nice, short recording on how easily you can view Data Lineage for your existing DataStage and QualityStage Jobs after simply importing them into 11.3 !

https://www.youtube.com/watch?v=zBHdC0lxLDc

The Jobs do NOT have to be “running” in 11.3. They can continue to run in their current production environment while you take advantage of all the new metadata features in 11.3. You import the Jobs and can also import the Operational Metadata from your earlier releases.

Ernie


Open IGC is here!

$
0
0

Hi Everyone….

Been awhile since I’ve posted anything — been too busy researching and supporting many new things that have been added in the past year — for data lineage, for advanced governance (stewardship and workflow), and now “Open IGC”.  This is the ability to create nearly “any” type of new object within the Information Governance Catalog and then connect it to other objects with a whole new lineage paradigm.    If you are a user of Extensions (Extension Mapping Documents and Extended Data Sources), think of Open IGC as the “next evolution” for extending the Information Server repository.   If you are a user of DataStage, think of what it would be like to create your own nested objects and hierarchies, with their own icons, and their own level of “Expand” (like zoom) capability for drilling into further detail.

This new capability is available at Fix Central for 11.3 with Roll-up 16 (RU 16) and all of its pre-requisites (FP 2 among other things).

So exactly what is this “Open IGC”?

Open IGC (you may also hear or see “Open IGC for Lineage” or “Open IGC API”), is providing us with the ability to entirely define our “own” object types.   This means having them exist with their own names, their own icons, and their own set of dedicated properties.     They can have their own containment relationships and define just about “anything” you want. They are available via the detailed “browse” option, and appear in the query tool. They can be assigned to Terms and vice versa, and participate in Collections and be included in Extension Mappings        …and then…once you have defined them, you can describe you own lineage among these objects, also via the same API, and define what you perceive as “Operational” vs “Design” based lineage (lineage without needing to use Extensions, and supporting “drill down” capabilities as we see with DataStage lineage).

Here are some use cases:

a) Represent a data integration/transformation process…or “home grown” ETL.    This is the classic use case.  Define what you call a “process” (like a DataStage Job)….and its component parts…the subparts like columns and transformations, and properties that are critical.   Outline the internal and external flows between such processes and their connections to other existing objects (tables, etc.) in the repository.

b)  Represent some objects that are “like” Extended Data Sources, but you want more definition…..such as (for example) all the parts of an MQ Series or other messaging system configuration…objects for the Servers, the Queue Managers, and individual Queues.  Give them their own icons, and their own “containment” depths and relationships.   Yes — you could use Extensions for this, but at some point it becomes desirable to have your own custom properties, your own object names for the user interface, and your own creative icons!

c)  Overload the catalog and represent some logical “concept” that lends itself to IGCs graphical layout features, but isn’t really in the direct domain of Information Integration.   One site I know of wants to show something with “ownership”…but illustrate it graphically.  They are interested in having “responsibility roles” illustrated as objects…whose “lineage” is really just relationships to the objects that they control.  Quite a stretch, and would need some significant justification vs using tooling more appropriate for this use case, but very do-able via this API.

It’s all done based on XML and REST, and does not require that you re-install or otherwise re-configure the repository.  You design and register a “bundle” with your new assets and their properties, and then use other REST invocations to “POST” new instances of the objects you are representing.

Quite cool…….and more to come…..I will be documenting my experiences with the API and the various use cases that I encounter.

What use cases do YOU have in mind?    :)

Next post in this series: Open IGC: a Simple Messaging Use Case

Ernie


Validating your REST based Service calls from DataStage

$
0
0

About a year ago the Hierarchical Stage (used to be called the “XML” Stage) added the capability of invoking REST based Web Services. REST based Web Services are increasing in popularity, and are a perfect fit for this Stage, because most REST based services use payloads in XML or JSON for their requests and responses.

REST based Web Services have a couple of challenges, however, because they do not use SOAP, and consequently, they rarely have a schema that defines their input and output structures. There is no “WSDL” like their is for a classic SOAP based service. On the other hand, they are far less complex to work with. The payloads are clean and obvious, and lack the baggage that comes with many SOAP based systems. We won’t debate that here…both kinds of Web Services are with us these days, and we need to know how to handle all of them from our DataStage/QualityStage environments.

Here are some high level suggestions and steps I have for working with REST and the Hierarchical Stage:

1. Be sure that you are comfortable with the Hierarchical Stage and its ability to parse or create JSON and XML documents. Don’t even think about using the REST step until you are comfortable parsing and reading the XML or JSON that you anticipate receiving from your selected service.

2. Start with a REST service “GET” call that you are able to run directly in your browser. Start with one that has NO security attached. Run it in your browser and save the output payload that is returned.

3. Put that output in a .json or .xml file, and write a Job that reads it (using the appropriate XML and/or JSON parser Steps in the Assembly) Make sure the Job works perfectly and obtains all the properties, elements, attributes, etc. that you are expecting. If the returned response has multiple instances within it, be sure you are getting the proper number of rows. Set that Job aside.

4. Write another Job that uses the REST Step and just tries to return the payload, intact, and save it to a file. I have included a .dsx for performing this validation. Make sure that Job runs successfully producing the output that you expect, and that matches the output from using the call in your browser.

5. NOW you can work on putting them together. You can learn how to pass the payload from one step to another, and include your json or xml parsing steps in the same Assembly as the REST call, or you could just pass the response downstream to be picked up by another instance of the Hierarchical Stage. Doing it in the same Assembly might be more performant, but you may have other reasons that you want to pass this payload further along in the Job before parsing.

One key technique when using REST with DataStage is the ability to “build” the URL that you will be using for your invocations. You probably aren’t going to be considering DataStage/QualityStage for your REST processes if you only need to make one single call. You probably want to repeat the call, using different input parameters each time, or a different input payload. One nice thing about REST is that you can pass arguments within the URL, if the REST API you are targeting was written that way by its designers.

In the Job that I have provided, you will see that the URL is set within the upstream Derivation. It is very primitive here — just hard coded. It won’t work in your environment, as this is a very specific call to the Information Governance Catalog API, with an asset identifier unique to one of my systems. But it illustrates how you might build YOUR url for the working REST call that you are familiar with from testing inside of your browser or other testing tool. Notice in the assembly that I create my own “argument” within the REST step which is then “mapped” at the Mappings section to one of my input columns (the one with the Derivation). The Job is otherwise very primitive — without Job Parameters and such, but simply an example to help you get started with REST.

Ernie

…another good reference is this developerWorks article by one of my colleagues:

https://www.ibm.com/developerworks/data/library/techarticle/dm-1407governrest/

BasicRESTvalidation.dsx


Open IGC. A Simple “Messaging System” Use Case

$
0
0

In the previous post in this series about Open IGC (https://dsrealtime.wordpress.com/2015/07/29/open-igc-is-here/), I described several use cases to get you thinking about how you might apply this technology to your own solutions. I have since encountered several other great use cases that I will discuss in future posts — but for now, let’s dive into one of them that has already been discussed: Messaging Systems.

A Messaging System or environment is a unique case of Source and/or Target. It’s not quite a “file”, although it can “contain a file”…nor is it the same as a Table. Queues have “data” but they also can store other things, and have lots of other qualifiers, such as persistence, message types, and read methods. There is an implied hierarchy in a messaging system, but it isn’t the same as a subdirectory with files or a schema with its collection of tables.

Governance covers many things, and queues and queuing systems certainly qualify as objects worth governing, depending on your specific needs. Queues and their accompanying objects may require Stewardship, Application and Term definitions, and can carry operational information, such as Current Queue Depths, or historical status’. Queues certainly can and should participate in lineage and impact analysis reporting, as they are often the “beginning” or the “termination” of a lineage “flow”.

All of these unique qualities justify the application of “Open IGC” to my Messaging System. I also should consider “volume” and “available skill sets”, but for now let’s assume that I have a significant number of messaging artifacts to justify the work effort, and the skills in xml and REST to get it done.

What will it look like for my users? What can I do with it once it is defined with Open IGC? (click on any of the images to see them “up close”)

Let’s see what the finished result looks like in the Information Governance Catalog (IGC). Once we register a new set of Object types (we call this “registering a new bundle”), the objects appear within each regular and expected context of IGC. I can browse the new Objects:

MessagingIcons

I can assign them to a Business Term or other relationships:

Assignment

…use them in a Query:

query

…and have them participate in Data Lineage Reporting:

lineage

The bottom line is that I can use them as I would most any other object that is part of Information Server, including Stewardship and integration with Rules and Policies. The fact that I am able to give these objects their own structure, their own properties, and their own icons, makes their use for governance more inviting to the user community and more understandable by everyone. This helps encourage adoption and participation in the governance framework.

Once the new bundle of object types is registered, I can populate the repository with actual instances. The brief lineage picture above gives you an idea of how objects of this messaging bundle participate in lineage analysis, but we can also review their details. Here is the detail page for one of the Queue Managers, showing just a few of the properties that have been modeled with this bundle, and populated for our environment:

QueueMgr

The Open IGC also provides a paradigm for including “Operational Metadata” in a one:many relationship that makes it convenient to include run time statistics or other details of your processes that may be important for your governance scenarios. Here you see how queue statistics might be captured and stored for later review:

runstats

This is a simple implementation. I am not representing a complex process, with inner subtasks [we’ll get there in a later post], yet have created a new set of objects that more clearly illustrate an important concept for the enterprise. Governance adoption can be simpler, and will bring aboard a new audience whose needs have been met with custom objects, icons, and relationships. Data lineage is supported with known tooling, using Extension Mappings that are already in use by other parts of the governance team.

Next post we’ll take a look at what is required to define new bundles like this and to load up new instances of metadata into the Information Server repository!

Next post in this series:

Open IGC: Defining a new bundle

–ernie


Viewing all 51 articles
Browse latest View live