In Germany this April?

clock February 5, 2009 05:26 by author Jamie

If so – check out the European PASS Conference, if not – maybe you should be!

This conference will have a great BI focus including a pre-conference session on using SQL Server BI tools to monitor SQL Server BI tools and at least two data mining sessions, including one from Steve Simon of State Street Corporation talking about how they personally leverage SQL Server Data Mining to manage their risk exposure.  Not another talking head from the product team (aka “me”) spouting the general awesomeness of the product, but user solving real-world user problems with the (imHo) best data mining product in the world.

If you can, check it out it will be a great chance to network with other users and see real-world implementations rather than MSFT demos!

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Querying the Dependency Net

clock December 12, 2008 08:42 by author Jamie

OK, ok, so just yesterday I posted that it was easy to determine what queries were being used by the browsers to get the data underlying the view.  Of course it’s easy to get them, but without a teensy weensy bit of documentation, it’s not necessarily easy to understand what the parameters mean or possibly what the results mean.

For instance, take the dependency net – everybody loves the dependency net – it’s cool and shows that “high-level information” that everybody craves.  For example, this picture showing how “Die Hard” is the nexus linking Beverly Hills Cop with the Lethal Weapon family of movies.

image

Whatever.

Anyway, to get this information we call the deceptively simple ARGetNodeGraph function, that is, for Association Rules models, like this:

CALL System.ARGetNodeGraph('Associate Movies', 60)

With the first parameter being the name of the model and the second being the number of nodes to return.  The function chooses nodes using a heuristic that considers the popularity of the node and balances between inputs and outputs in order to produce a nice result.

That’s easy enough – let’s take a look at the output (truncated for space)

image

Whoa!  That’s a bit more complex.

Basically the output is divided into two sections indicated by the "NODE_TYPE” column.  The “NODE_TYPE” column actually has nothing to do  with a node type and (if I had to guess, which I don’t) I would say that NODE_TYPE was used to reuse names from the MINING_CONTENT schema rowset rather than be the most accurate moniker for the column itself.  NODE_TYPE is actually the ROW type, and has the values 1 or 2.  If the NODE_TYPE is 1, then the row represents a NODE in the graph.  If the NODE_TYPE is 2, the row represents an EDGE in the graph.  All of the other column interpretations depend on the type of row.

For NODE rows (NODE_TYPE=1, for those readers with serious short-term memory issues), the columns describe a node like this:

  • NODE_UNIQUE_NAME_1 – This is an ID used to identify the node in the edges section.
  • NODE_UNIQUE_NAME_2 – The default label for the node
  • MSOLAP_NODE_SCORE – Unused
  • ATTRIBUTE_NAME – The attribute name part of the default label, used to compose and decompose different options for node displays
  • ATTRIBUTE_LONG_NAME – The full attribute name – i.e. if the attribute came from a nested table, it has the name of the nested table, plus the specific attribute name in parenthesis.  Again used for label composition.
  • ATTRIBUTE_VALUE – The value name part of the default label, again used for label composition.

For EDGE rows (NODE_TYPE=2) the columns describe a directed edge like this:

  • NODE_UNIQUE_NAME_1 – The ID (from the NODE rows) of the source of the edge
  • NODE_UNIQUE_NAME_2 – The ID (from the NODE rows) of the destination of the edge
  • MSOLAP_NODE_SCORE – The strength of the edge – I don’t believe there is any specific range
  • Other columns - unused

So, that’s how the dependency net gets created – initially.  There are actually many additional functions used by the dependency net to, as you may say, fill out the graph.

For example, if you click the “Find Node” button in the dependency net browser, the browser issues this call:

CALL System.ARGetNodes('Associate Movies')

This call returns a result set like the NODE section of the ARGetNodeGraph, except without the NODE_TYPE column, with a row for every possible node – not just the top 60.  The only parameter is the name of the model.

If you select a node that is not already in the graph, this is where it gets a bit interesting.  The browser issues a call like this:

CALL System.ARAddNodes('Associate Movies', '600', 
     '604;726;648;733;630;700;719;718;130;387;386;712;621;727;670;154;337')
 

ARAddNodes has the following parameters:

  • strModel – the model name, but you already new this
  • strNodesToAdd – a semi-colon delimited list of node ids to add to the graph
  • strNodesInGraph – a semi-colon delimited list of node ids already in the graph

The result set looks like the EDGE section of the result of ARGetNodeGraph without the unused columns and contains only the edges between the nodes identified in strNodesToAdd and those identified in strNodesInGraph.  Note that the node id’s that are used are only those returned from ARGetNodeGraph or ARGetNodes and are not node id’s from the model content schema rowset.

NB When you see the function calls in SQLProfiler, you will get the fully qualified function name, e.g. System.Microsoft.AnalysisServices.System.DataMining.AssociationRules.ARAddNodes.  You can eliminate all the intermediate namespaces and just call System.<function name>.

NB2 There are a set of equivalent stored procedures for Decision Trees, that you can probe by browsing a tree model’s dependency network

NB3 Nope, you won’t find a Naive Bayes version by browsing a NB model’s dep net – that browser was never “updated” to use stored procedures to get dependency network information.  You can use the Visio Data Mining Template and see what functions are called….but their different…..

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Data Mining in SQL Server 2008 Book Review

clock December 11, 2008 05:12 by author Jamie

I just read a great review of our book from Richard Lees in Australia.  Richard is one of the early adopters of Analysis Service and Data Mining, so he has a lot of experience in this area (we actually reference some of his samples in the book!).  You should check out his blog anyway, just to learn from a master and expand your BI expertise!

Currently rated 2.0 by 2 people

  • Currently 2/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Querying like the Data Mining viewers do

clock December 10, 2008 10:15 by author Jamie

It happens all the time.  You see some cool user interface trick and think “how can I do that?”  Or, maybe more likely, you think “gee, that’s useful, where’s that documented?”  In any case, if you have ever wondered about how the DM viewers get the data to display on the screen, then this is the post for you.

In many cases, what is displayed in the data mining viewers is the result of built-in stored procedures which allow the processing required for the view to be done on the server without requiring all of the model content to be brought to the server.  In this post, I’ll give one example of those stored procedures, and then I’ll add some information on how you con figure out the rest.  Not everything in the viewers is calculated on the server, but many of them are, and getting access directly to the data is useful (I presume, since everyone always asks…. :) )

Let’s start with an easy one – the Naive Bayes attribute discrimination view.  This view shows how the differences in the input attributes across the states of an output attribute.  It generally looks like this:

image

The viewer doesn’t download all of the correlations in the Naive Bayes content, rather it calls the stored procedure GetAttributeDiscrimination like this:

 

CALL System.GetAttributeDiscrimination
  ('Classify CollegePlans NB', 
   '100000005', 
   'Plans to attend', 
   1, 
   'All other states', 
   2, 
   0.0005, 
   true)

The not-so-obvious parameters are, in order strModel, strPredictableNode, strValue1, iValType1, strValue2, iValType2, dThreshold, and bNormalize.  Let’s go through these parameters…

strModel – the name of the model, of course!

strPredictableNode – this one is a bit difficult, as it takes the Node Unique Name of the target attribute instead of just the string you see in the viewer.  The Node Unique Name identifies the attribute in the content rowset generated by the model.  You can get the list of predictable attributes and their Node Unique Names by calling another stored procedure – like this CALL System.GetPredictableAttributes('ModelName').  This stored procedure returns two columns – one for the attribute name and one for the Node Unique Name.

strValue1 – The name of the value you want to compare on the left hand side.  The usage of this parameter depends on the value of the next parameter, which is….

iValType1 – This parameter indicates how to treat strValue1.  It can have values 0,1, or 2.  If this parameter is a 1, the value in strValue1 is the actual state of the attribute.  However, if this parameter is a 0 or 2, the value in strValue1 is ignored.  If the value is 0, the left-hand value is considered to be the “missing state”.  If the value is 2, the left hand value is considered to be “all other states.”  In the example above, “All other states” is specified only because it looks nice (and it’s easier to just drop the combo box value into the function call even if it will just be ignored….)

strValue2 – Like strValue1, but for the right hand side.

iValType2 – Like iValType2, but for the right hand side.

dThreshold – A threshold value used to filter results, such that small correlations don’t come back in the results.    Usually you set it to a really small number like 0.0005 in the example above.

bNormalize  - Whether or not the result is normalized.  If this value is true, the results are normalized to a maximum absolute value of 100, giving a possible range of –100 to 100.  All this does is take the largest absolute value in the result and divide that into 100, and then multiple all the other numbers by that amount.  If set to false, the numbers are whatever they are and you can figure it out yourself – it’s up to you, but we always set this to true.

The results

Calling this routine returns a row for every differentiating attribute/value pair with a score higher than the specified threshold.  The row contains the differentiating pair along with the score and some other columns and looks somewhat like this:

image

The score column is the “important” one and is best explained as if you did something like a c language compare routine e.g int Compare(int v1,int v2) { return v1-v2; } .  That is, if the value is positive it favors value1 and if the value is negative it favors value2.  I’m not going to go in depth on the other columns other than to say that they are the actual counts of the correlations of the discriminator against the inputs.  The best way to understand them is to look at the Mining Legend as you browse the model and click on rows.  For example if you clicked on the first row of the result above (in either picture), the Mining Legend would look like this:

image

Of course, once you have the result set you can use it wherever you want – in Reporting Services, Integration Services, or in you custom program.

How to get the function calls

So, how do you find this laundry list of undocumented stored procedures.  Well, some are documented in my book, but you can get them all for free just by looking in the right place.  You can run the SQL Server Profiler to see what functions are being called by the viewers.  Here’s how you do it.

First, run SQL Server Profiler:

image

Then, start a New Trace from the File Menu and connect to Analysis Services

image

You can leave all the defaults on for the Trace Properties dialog that appears

image

Then go to any data mining viewer and browse a model!  That’s it!  You will get trace output that looks like this:

image

You can get the query text in the bottom pane of the trace window by finding and selecting rows with event types of ‘Query Begin’ or ‘Query End’.

Oh, and make sure you stop the trace before you shut down SQL Profiler - ‘cuz it will keep going.

Let me know if you find any particularly tricky stored procedures that you need help with – best place to ask is in the MSDN forums!

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Those kids won't eat anything!

clock November 20, 2008 08:16 by author jamie

I did my BI Power Hour demo at PASS 2008 yesterday and it featured my twin boys Bowen and Logan.

 IMG_0450sm

Logan (right) has an ASD (Autism Spectrum Disorder) that limits his diet (no milk products, gluten, or soy) and Bowen has some sensory issues, but that doesn't explain their extreme pickiness with food.  For example, I made a rice-yogurt-blueberry smoothie for Logan and he just looked at it and said "yucky!".  Ugh.

So, I decided to make a worksheet listing foods they these kids will eat and won't eat.  I used attributes of Color, Type, and Processed, along with a column indicating whether or not they will actually eat the food.  Of course, I had to answer to myself disturbing questions such as "what color are hot dogs?", but I got through it.

image

Then I used the Prediction Calculator from the new Table Analysis Tools Excel addin for SQL Server 2008.  The Prediction Calculator creates a little widget in Excel that allows you to enter in input values and based on your costs.  Running the Prediction Calculator is as simple as selecting your table, clicking the Prediction Calculator button on the Table Analyze ribbon, and then choosing the column and value you want to predict.

imageimage

There's actually a little more work to do after you run the tool, and that is specifying your costs and profits.  Your costs are the cost you incur for getting the answer wrong, and a profit is the profit you make when you get the answer right.  The grid below is in Prediction Calculator Report that is created after running the tool.  In this case, I figured that if I guessed that my kids would eat some food and was wrong, it would cost me the 5 bucks for the food which would be wasted, therefore I set the False Positive Cost to 5.  Furthermore, I figured that if I guessed correctly that they wouldn't eat a food, I saved the money and the 5 bucks would still be in my pocket, so I set the True Negative Profit to 5 as well.

image

Doing so, gave me a profit chart that looks like below - which is a problem.  Basically, what a "always rising" chart says is that you should always say "no" to achieve the highest profit - which makes sense since I can only lose money by saying "yes" and only gain money by saying "no".  Essentially my laptop analysis tells me that my kids are simply too picky and I should just make them starve!  Hah!  My laptop apparently has never had kids!

image

Anyway, I decided that there's some nominal value for my kids eating, so I changed the parameters a bit.  I figured that my kids complaining that I didn't get a food that they wanted causes me the psychological cost of 1 dollar (or maybe the real cost of going back to the store of a dollar, however you want to see it), and I set the False Negative Cost to 1.  Also I decided the value of my kids not getting a sugar imbalance and (literally) bouncing off the walls is a "peace of mind" profit of a dollar, so I set the True Positive Profit to 1 as well.  This gives me a better behaved profit chart with a peak like below.

image

image

Once I've set my costs, I can go to the Prediction Calculator sheet that was created and select my inputs and see if my kids will actually eat the food.  In this case, "Yellow, unprocessed, grains and nuts" doesn't exceed the threshold of 642, so the answer is no.  Yay!  I saved 5 bucks because my kids won't eat corn :(.

image

Of course, this isn't very useful for me in this format - it's not like I'm going to lug my laptop around the grocery store plugging in values for every product I see on the shelf.  So to get around this I use the new, experimental Cloud Data Mining ServiceThe web interface contains many of the Table Analysis Tools, including the Prediction Calculator, and you can access data from CSV files, SQL Data Services, or another way which is not entirely obvious and not documented by simply pasting your data from Excel to the web.

image

Once you've pasted your data, I run the Prediction Calculator just like I did in Excel.  There's one small difference, however, in the result is that I have the HTML fragment for the calculator itself.  Therefore I can make my own web site with the calculator embedded inside.

image

Once I have my website (which in this case is at 

http://www.sqlserverdatamining.com/PASS2008BIPowerHourDemo.htm) I can access the Prediction Calculator from any web-enabled device - like my phone, which I can carry around the grocery store and determine that my kids will eat .... brown.....processed.....meat.... oh yay....

Untitled 3

Currently rated 4.8 by 4 people

  • Currently 4.75/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Time's running out for your free data mining book!

clock November 20, 2008 06:50 by author jamie

 The survey is now closed. Thank you 

 Tomorrow is the last day you can fill out this survey for a chance to win one of ten copies of Data Mining with SQL Server 2008.  I used the the Data Exploration tool in the SQL Server 2008 Data Mining Client for Excel, and saw that it takes most people less than 15 minutes to fill it out.

image

(time to take survey in seconds)

Regarding the book, just today I received my sample copies and I was surprised at how much bigger it is than the 2005 book!  It rounds out at 636 pages - I remember last version we were running up against publisher defined page limits and we cut back material to make it fit.  This time, we just wrote what needed to be written and the publisher agreed to let us, the authors, make the decisions on how long the book should be.  I'm really happy about how the text turned out this time - we still don't have a review on Amazon, so hopefully the public will agree!

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


See me at PASS in Seattle next week

clock November 13, 2008 06:21 by author jamie

I'll be at the PASS 2008 conference in Seattle most of next week.  Currently (subject to change) I'll be presenting at the BI Power Hour on Wednesday at 1:45 and will be in the Ask the Experts area on Wednesday after the Power Hour session and Thursday from 11 to 2.

Stop by if you have any questions or just want to chit-chat.  I'll be happy to sign any book you bring by!  (Even if I'm not the author!)

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


RecentPosts

Sign in