(somewhat) Social
« Loading MS Connect To A DB–Part 3: Keeping The Data Current | Main | T-SQL Tuesday #14–Resolutions »
Tuesday
Jan112011

Loading MS Connect To A DB–Part 2: Parsing & Loading RSS Data

In Part 1 of the series we set up our database tables ready to load the data, next up is pulling the data from the RSS feed. I’m using PowerShellISE for all this work, it’s a simple and quick tool for building scripts.

Subscribing To The RSS Using PSRSS

The PSRSS module makes it extremely easy to add a new RSS subscription, we just need to call New-Feed with a name and the URL to the XML file.

The URL for the new SQL Server Connect items feed is https://connect.microsoft.com/rss/68/RecentFeedbackForConnection.xml. I decided to give this feed a name of "Connect: SQL Server New" giving us the Powershell command

 
New-Feed -Name "Connect: SQL Server New" -Url https://connect.microsoft.com/rss/68/RecentFeedbackForConnection.xml
 

We can confirm that the subscription was successfully created using Get-Feed.

get-feed | select Name | where-object {$_.Name -match "Connect: SQL Server New"}
Reading The RSS

Pulling articles from the RSS subscription is as easy as calling Get-Article along with the feed name.

In order to start working with the data I wanted to easily identify the member parts of the articles. To do this I decided to just pull the first article and look through it’s contents

Get-Article "Connect: SQL Server New" | select -First 1

Looking through the results we see a couple of key items that we are going to want to load into the database:

  • Title (Title column in ConnectItems)
  • Link (URL column)
  • Guid (ID column)
  • Description (ItemDescription column)
  • Author (Author column)
  • Modified (Modified column)

The PubDate would be very useful to have as an article create date however this appears to just be a default value that doesn’t bear any actual relation to the date the Connect Item was added.

It’s great that we have direct mappings for much of the data we want to load, our challenge comes with trying to get the Status, Up-Votes and Down-Votes from the Description.

 

Parsing The Description Data

Let’s take a closer look at the Description data

Report Builder reports errors with controls i.e. textbox22. But there is no way to find that control.
                  
                   In BIDS there is a drop down to select a control by. There needs to be something similar in RB.<br><br>Status: Active, 4 Up-Votes, 0 Down-Votes, 0 workarounds, 0 comments, feedback id: 634019

Having looked through a few other articles there was always the <br><br> prior to the information that I wanted to pull. This allowed me to use the substring function of Powershell along with LastIndexOf of to pull just the data highlighted above.

The Substring function works the same way as in SQL, pass in the string along with the starting position you want to read from and the number of characters you want to read.

LastIndexOf will return the position of the final entry of a given string in the data you pass it.

If we use this item as an example the string <br><br> is last seen at position 198

$FeedItem = Get-Article "Connect: SQL Server New" | select -First 1 
$FeedItem.Description.LastIndexOf("<br><br>")

This value can be used within the Substring function as a starting position. A simple math function adding 8 will eliminate <br><br> being in the results, and by not passing a second numerical value to Substring we actually get all the data from the position 206 to the end of the Description.

$FeedItem = Get-Article "Connect: SQL Server New" | select -First 1 
$FeedItem.Description.Substring($FeedItem.Description.LastIndexOf("<br><br>")+8)

This has given us a new string to work with that we can parse down to get our information.

Status: Active, 4 Up-Votes, 0 Down-Votes, 0 workarounds, 0 comments, feedback id: 634019

 

Grabbing The Vote Counts

So how to get the counts out of the data? Enter the awesomeness of Regular Expressions.

Using our LastIndexOf and Substring functions again we can now pass this into the regex class and get do some pattern matching to pull data.

In particular we’re only looking for numbers in the string. Using the expression \d{1,5} allows us to look for numerical value between 1 and 5 characters long (download a regular expressions cheat sheet to help understand this).

$FeedItem = Get-Article "Connect: SQL Server New" | select -First 1 
$Substr = $FeedItem.Description.Substring($FeedItem.Description.LastIndexOf("<br><br>")+8)
$UpCount = [regex]::Matches($Substr.Substring($Substr.LastIndexOf("Up-Vote")-5, 5), "\d{1,5}") | Select Value
[int]$UpCount.Value

Running this gives us a result of 4. Confirming this against the full string we know we’re getting the right data.

We repeat this changing Up-Vote for Down-Vote to get the Down-Votes from the description.

Another use of substring provides us with the status.

All of these values are loaded into variables for use later.

 

Loading The Data To The Database

Our Status information as well as Up-Vote and Down-Vote counts are sitting in variables. We also load the full Description, Link, Title, Guid, Author and Modified data into variables as well. We use these to build a SQL insert statement:

$FeedItem = Get-Article "Connect: SQL Server New" | select -First 1 
$Substr = $FeedItem.Description.Substring($FeedItem.Description.LastIndexOf("<br><br>")+8) -replace "'", "" -replace "`"", ""
$UpCount = [regex]::Matches($Substr.Substring($Substr.LastIndexOf("Up-Vote")-5, 5), "\d{1,5}") | Select Value
$ItemUpVote = [int]$UpCount.Value
$DownCount = [regex]::Matches($Substr.Substring($Substr.LastIndexOf("Down-Vote")-5, 5), "\d{1,5}") | Select Value
$ItemDownVote = [int]$DownCount.Value
$val = $Substr.Substring(0, $Substr.LastIndexOf("Up-Vote"))
$ItemStatus = $val.Substring(0, $val.LastIndexOf(",")) -replace "Status: ",""
    $ItemDescription = $FeedItem.Description.Substring(0, $FeedItem.Description.LastIndexOf("<br><br>")) -replace "'", "" -replace "`"", ""
    $ItemURL = $FeedItem.Link
    $ItemTitle = $FeedItem.Title -replace "'", "" -replace "`"", ""
    [int]$ItemID = $FeedItem.Guid
    $ItemAuthor = $FeedItem.Author
    $ItemModified = $FeedItem.Modified
 
    $SQLInsert = @"
        INSERT INTO [dbo].[ConnectItems_Staging]
           ([ID],[Title],[URL],[ItemStatus],[UpVoteCount],[DownVoteCount],[ItemDescription],[Author],[Modified])
        VALUES
           ($ItemID,'$ItemTitle','$ItemURL','$ItemStatus',$ItemUpVote,$ItemDownVote,'$ItemDescription','$ItemAuthor','$ItemModified')
"@
$SQLInsert

The produced insert statement is

INSERT INTO [dbo].[ConnectItems_Staging]
           ([ID],[Title],[URL],[ItemStatus],[UpVoteCount],[DownVoteCount],[ItemDescription],[Author],[Modified])
        VALUES
           (634019,'Enable controls to be found by name in Report Builder','http://connect.microsoft.com/SQLServer/feedback/details/634019/enable-controls-to-be-found-by-name-in-report-builder', 'Active', 4,0,'Report Builder reports errors with controls i.e. textbox22. But there is no way to find that control.

In BIDS there is a drop down to select a control by. There needs to be something similar in RB.','Simon Sabin','12/30/2010 14:50:13')

Key within the script is the multi-lined statement

$SQLInsert = @"
        INSERT INTO [dbo].[ConnectItems_Staging]
           ([ID],[Title],[URL],[ItemStatus],[UpVoteCount],[DownVoteCount],[ItemDescription],[Author],[Modified])
        VALUES
           ($ItemID,'$ItemTitle','$ItemURL','$ItemStatus',$ItemUpVote,$ItemDownVote,'$ItemDescription','$ItemAuthor','$ItemModified')
"@

This doesn’t need to be multi-lined however that does make it a lot easier to read. Creating this is as simple as encapsulating the data between @” and “@.

Note (I’m not sure if this is a bug or deliberate) is that the “@ have to be the first two characters on the line ending the statement.

Hopefully you’re loading the SQL cmdlets with your Powershell profile, if not you’ll need to load those manually in order to work with the database

Add-PSSnapin SqlServerCmdletSnapin100 | out-null
Add-PSSnapin SqlServerProviderSnapin100 | out-null

With these loaded inserting the data is as simple as running invoke-sql.

invoke-sqlcmd -ServerInstance "localhost" -Database "DenaliConnect" -Query "$SQLInsert"

 

The data is now loaded into the ConnectItems_Staging table. We then called the MERGE proc in order to populate the main table (this helps prevent attempts to load duplicate data).

invoke-sqlcmd -ServerInstance "localhost" -Database "DenaliConnect" -Query "exec dbo.[MergeItems]"

 

A quick query in SSMS will validates the data is loaded correctly.

 

Final Steps

Not content with just pulling in the new items RSS feed I also decided to pull the recently modified one as well. This helped keep data consistent and add items that might have been missed.

To simplify management of this I turned all the logic for parsing the data into a function called Get-DenaliConnect and just called the function for each RSS feed.

Get-DenaliConnect "Connect: SQL Server New" https://connect.microsoft.com/rss/68/RecentFeedbackForConnection.xml
Get-DenaliConnect "Connect: SQL Server Modified" https://connect.microsoft.com/rss/68/RecentlyModifiedFeedbackForConnection.xml

 

You can download the full script here.

Next up we’ll cover keeping the data current.

 

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>