wiki

Codit Wiki

Loading information... Please wait.

Codit Blog

Posted on Wednesday, December 30, 2015 2:00 PM

Tom Kerkhove by Tom Kerkhove

In my previous blog post I've introduced you to a blog series where we will analyse StackExchange data by using Microsoft Azure Data Lake Store & Analytics.

Today I'll illustrate where we can download the StackExchange sample data & how we can upload and store it in the Data Lake Store by using PowerShell.

There are several options for data storage in Azure, each with a specific goal. For data analytics -especially with Azure Data Lake Analytics- Azure Data Lake Store is the "de facto".

The StackExchange data is made available on Archive.org as zip-files. We will use an Azure VM to download it from the website, unzip every folder and upload it to the Store. Let us start!

Why do we use Azure Data Lake Store over Azure Blob Storage?

Before we start you might ask why we are using Azure Data Lake Store over Azure Blob Storage?

The reason is very simple - We are planning to store a decent amount of data and perform analytics on it with Azure Data Lake Analytics.

While Azure Blob Storage can be used with Azure Data Lake Analytics, it is recommended to use Azure Data Lake Store instead. The service is built for running analytical workloads on top of it and is designed to scale along with its load.

Azure Data Lake Store also offers unlimited storage without any limits on file or account level while this isn't the case for Azure Storage.

However - Storing all your data in Azure Blob Storage will be a lot cheaper than storing it in Azure Data Lake, even when you are using Read-Access Geographically Redundant Storage (RA-GRS).

These are only some of the differences they have. They also differ in different topics regarding access control, encryption, etc.

To summarize - There is no silver bullet. It basically depends on what your scenario is and how big the data is you want to store. My suggestion is that if you'll do big data processing in Azure, you should use Azure Data Lake Store!

If for some reason you decide that the store you've picked doesn't fit your needs, you can still move it with tools like Azure Data Factory or PowerShell.

Note - During the public preview of Azure Data Lake Store it will be cheaper but keep in mind that this is at 50% of the GA pricing.

Preparing our environment

For this phase we'll need to provision two resources: A new Azure Data Lake Store account & an Azure VM in the same region.

But do I really need an Azure VM?

Benefits of using an Azure VM

It is also possible to do everything locally but I personally recommend using a VM because we can let it run overnight more easily and will be faster.

It allows us to download a file of 28 GB in the Azure datacenter, unzip 250+ folders overnight and upload 150 GB to the Store. This means that we will only pay for 28 GB of ingress instead of 150 GB, however you need to take into account that you need to pay for your VM.

You will only benefit from this if the resources are allocated within the same region, otherwise Azure will charge you for 150 GB of egress & ingress.

Provisioning a new Data Lake Store

To provision a Data Lake Store resource, browse to the Azure portal and click on 'New > Data + Storage > Data Lake Store (Preview)'.

Give it a self-describing name, assign a resources and location and click 'Create'.

After a couple of minutes, the Store will be created and you should see something similar to this.

As you can see it includes monitoring on the total storage utilization, has a special ADL URI to point to your account and has a Data Explorer. The latter allows you to navigate and browse through your data that is stored in your account.

At the end of this article you should be able to navigate through all the contents of the data dump.

Provisioning & configuring a VM

Last but not least, we'll provision a new Azure VM in which we will download, unzip & upload all the data.

In the Azure Portal, click 'New > Compute' and select a Windows Server template of your choice. Here I'm using the 'Windows Server 2012 R2 Datacenter' template.

Assign a decent host name, user name & solid password and click 'Create'.

We will also add an additional data disk to the VM on which we will store the unzipped data as the default disk is too small.

To do so, navigate to the VM we've just provisioned and open the 'Settings' blade.

Select 'Disks', click on 'Attach New' and give it a decent name. 

We don't need to increase the default value as 1024 GB is more than enough.

Once the disk is added it will show up in the overview. Here you can see my stackexchange-data.vhd data disk.

Now that the disk is added we can connect to the machine and prepare it by formatting the disk and giving it a decent name.

Now that we have a Data Lake Store account and a VM to handle the data we are ready to handle the data set.

Retrieving the StackExchange data

StackExchange has made some of their data available on archive.org allowing you to download insight about all their websites.

The websites provide you several options for downloading everything going from a torrent to individual zips to one large zip.

I personally downloaded everything in one zip and two additional files - Sites.xml & SitesList.xml.

As we can see I've stored all the information on the new data disk that we have added to the VM.

Extracting the data

Time to unzip the large files into individual zip files per website, to do so you can use tools such as 7-Zip.

Once it's done it should look similar like this.

Next up - Unzipping all the individual websites. It is recommended to select all the zip-files, unzip them at once.

Grab a couple of coffees because it will take a while.

You should have around 150 GBs of data excl. the zip-files.

So what kind of data do we have?!

Looking at the data

Now that we have unwrapped all the data we can have a look at what data is included in the data dump.

As mentioned before, the zip contains a folder for each website by StackExchange, incl. all the meta-websites.
Each folder gives your more information about all the relevant data for that specific website going from users & posts to comments and votes and beyond.

Here is all the data included that is included for coffee-stackexchange-com in this example:

+ coffee-stackexchange-com
    - Badges.xml
    - Comments.xml
    - PostHistory.xml
    - PostLinks.xml
    - Posts.xml
    - Tags.xml
    - Users.xml
    - Votes.xml

However, there is one exception - Since StackOverflow is so popular, there is a lot more data and thus bigger files. Because of this they have separated each file across a dedicated folder per file.

Here is an overview of how the data is structured:

+ stackapps-com
    - Badges.xml
    - ...
    - Votes.xml
+ stackoverflow-com-badges
    - Badges.xml
+ stackoverflow-com-...
+ stackoverflow-com-votes
    - Votes.xml
+ startups-stackexchange-com
    - Badges.xml
    - ...
    - Votes.xml

With that structure in mind, let's have a look at how we can upload the data to Azure.

Uploading to Azure with PowerShell

In order to upload all the data, it would be a good thing to automate the process, luckily Azure provides a lot of PowerShell cmdlets that allow you to automate your processes.

For our scenario I've created a script called ImportStackExchangeToAzureDataLakeStore.ps1 that will loop over all the extracted folders & upload all its files to a new directory in Azure Data Lake Store.

Although it's a simple script I'll walk you through some of the interesting commands that are used in the script.

In order to interact with Azure Data Lake Store from within PowerShell we need to use the Azure Resource Manager (Rm) cmdlets.

To do so we first need to authenticate, assign the subscription we want to use and register the Data Lake Store provider.

# Log in to your Azure account
Login-AzureRmAccount

# Select a subscription 
Set-AzureRmContext -SubscriptionId $SubscriptionId

# Register for Azure Data Lake Store
Register-AzureRmResourceProvider -ProviderNamespace "Microsoft.DataLakeStore" 

With Test-AzureRmDataLakeStoreItem-command we can check if a specific path already exists in the account, i.e. a folder or file.

$FolderExists = Test-AzureRmDataLakeStoreItem -AccountName $DataLakeStoreAccountName -Path $DataLakeStoreRootLocation

If the specified would not exist, we could create it in the store with the New-AzureRmDataLakeStoreItem-command.

New-AzureRmDataLakeStoreItem -AccountName $DataLakeStoreAccountName -Folder $DestinationFolder

In our scenario we combine these two commands to check if the folder per website, i.e. coffee-stackexchange-com, already exists. If this is not the case, we will create it before we start uploading the *.xml-files to it.

Uploading is just as easy as calling the Import-AzureRmDataLakeStoreItem with the local path to the file telling it where to save it in the store.

Import-AzureRmDataLakeStoreItem -AccountName $DataLakeStoreAccountName -Path $FullFile -Destination $FullDestination

That's it, that's how easy it is to interact with Azure Data Lake Store from PowerShell!

To start it we simply call the function and pass in some metadata: What subscription we want to use and what the name of the Data Lake Store account is, where we want to upload it and where our extracted data is located.

C:\Demos > Import-StackExchangeToAzureDataLakeStore -DataLakeStoreAccountName 'codito' -DataLakeStoreRootLocation '/stackexchange-august-2015' -DumpLocation 'F:\2015-August-Stackexchange\' -SubscriptionId '<sub-id>'

While it's running you should see how it is going through all the folders and uploading the files to Azure Data Lake.

Once the script is done we can browse through all our data in the Azure portal by using the Data Explorer.

Alternatively you could also update it to Azure Blob Storage with ImportStackExchangeToAzureBlobStorage.ps1.

Conclusion

We've seen how we can provision an Azure Data Lake Store and how we could use an infrastructure in Azure to download, unzip and upload the StackExchange data to it. Also we've had a look at how the dump is structured and what data it contains.

I've made my scripts available on GitHub so you can test it out yourself!
Don't forget to turn off your VM afterwards...

In a next blog post we will see how we can aggregate all the Users.xml data in one CSV file by Azure Data Analytics and writing one U-SQL script. This will allow us to analyze the data later one before we visualize it.

If you have any questions or suggestions, feel free to write a comment below.

Thanks for your time ,

Tom.

Categories: Azure
Tags: Data Lake
written by: Tom Kerkhove

Posted on Wednesday, March 2, 2011 12:08 AM

Sam Vanhoutte by Sam Vanhoutte

Lately, a lot of CTP’s came available on the Windows Azure platform. My goal is to try them all out. This is a blog on SQL Azure Data Sync.

Lately, a lot of CTP’s came available on the Windows Azure platform.  My goal is to try them all out.  I blogged on two of these technologies already:

This time, we’re getting on the data side, with SQL Azure Data Sync.  This data synchronization service is built on the Microsoft Sync Framework technologies. It provides bi-directional data synchronization and data management capabilities allowing data to be easily shared across SQL Azure databases within multiple data centers.

A very good and detailed overview on Data Sync can be found on the TechNet Wiki: http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-data-sync-overview.aspx

Scenarios

This SQL Azure Data Sync service is a very promising and interesting feature that opens a lot of great scenarios:

  • For development/testing purposes: provide data in your local database (on development) and synchronize with the Azure database to feed the cloud instances.
  • Synchronizing configuration or master data across databases in a very easy way.
  • Use SQL Azure Reporting Services on ‘local’ data.

Registration for CTP

Since this feature is still in CTP, you need to register first to get an invitation code.  To do all this, just browse to https://datasync.azure.com and sign in with your live id.  After this, you can enter your e-mail and other details and you should receive an invitation code some time later.  With that registration code, you can log on to the full features.

Concepts

The following concepts are important in setting up Data Sync:

  • Sync group: a sync group is a group of databases that can be synchronized together
  • Database: a database that gets registered for SQL Azure Data Sync
  • Agent: an agent is a Windows Service that performs and orchestrates the actual synchronization

Configuring data synchronization

Step 1: download the Sync Agent

A Sync agent is only required when synchronizing data from an on-premise database.  If the synchronization is set up from Cloud to Cloud, then it is not required to download the Agent, since the synchronization logic will be run in the Cloud.

For this example, I will synchronize a local database with a SQL Azure database.

On the Data Sync portal, click the Agents tab page and download the Agent installer at the bottom of the page.  After running the installer (and providing credentials for the Data Sync windows service), you can open a local configuration tool, through the Start menu: SQL Azure Data Sync Agent CTP2.

This tool allows you to configure local databases for synchronization.  The following screenshot shows the default configuration window, after installation.  To join this agent to a Windows Azure subscription, it is important to configure the Agent Key first.  This can be done by clicking the Edit Agent Key button and providing the key that can be copied from the Data Sync Portal. 

Once this is configured, it should be possible to Ping the Azure Sync Service, by clicking the Ping Sync Service button.

image

Step 2: Add your local database

Now we have the local agent installed, we will add a local (on premise) database to the configuration tool, so that it can be synchronized later.

In this example, I am using a custom database with 3 tables: Customer, OrderLines, OrderEvents.

image

  • I will add this database in the SQL Azure Data Sync Agent tool, by clicking the ‘Add Member’ button at the left.  This pops up a configuration window where the server, database and the authentication method need to be selected. (1) 
  • The Data Sync services have some limitations on supported data types, etc.  It is possible to check the database for issues, by clicking the ‘Check Member Schema’ button. (2).
  • I added a field with an unsupported data type (geography) to the customer table and the Schema validation provided a warning that indicated that this column would be skipped. (3) This is because geography is not supported in Windows Azure at this point in time.
  • It is very important to start the Windows Service of the Azure Data Sync Agent, in order to register the client database with the Windows Azure subscriptions.
image image image

Step 3: Add the SQL Azure database

Adding a SQL Server database to the Data Sync subscription is much easier and can be done on the Data Sync management portal, by clicking the Databases tab and clicking the Add button.  In the pop up, you just need to provide the server, database and credentials, before saving. 

In my example, I am just adding a new empty database, here.

image

Step 4: Set up the synchronization, through a Sync Group

  • In the management portal, click the New Sync Group button in the Sync Groups tab. And add all the databases you want to synch together to the database list(1) and click Next.
  • In the next screen, you can add the tables, per database that you want to synchronize.  For this demo, I want to synchronize everything, except the events table. (2)  You can also enable a schedule for the synchronization (expressed in minutes).
  • Once the group is created, you can synchronize the group easily.

imageimage

Testing the data synchronization

Now I have the databases configured for synchronization, it’s time to play around with it a bit.  To execute the synchronization, you can either rely on the synchronization schedule , when configured on the Sync group, or you can manually trigger the synchronization.

image

1st synchronization: Creating tables in the cloud.

The first synchronization I did created the two synchronized tables on my empty Cloud database and added the data there.  One thing to notice is that the Location column (with the spatial data type) was not created on the SQL Azure database. 

2nd synchronization: Adding data locally, synchronizing to the cloud.

In this test, I added two customers and some order lines to the local database and synchronized with the Cloud database, to find out that everything was copied without problems.

3rd synchronization: Adding data in the cloud, synchronizing locally.

In this test, I added a customer to the cloud database and synchronized with the on premise database, to find out that everything was copied without problems.

4th synchronization: Adding data on premise and in the cloud, synchronizing bi-directionally

In this test, I added customers in both databases before synchronization to find out that the data was synchronized correctly. 

Data conflicts

Now, I wanted to simulate some conflicts to find out how the Data Sync would handle them.

Adding or updating data in both databases, with the same primary key.

I added two different customers, but with the same primary key in both databases.  But, surprisingly, the synchronization happened without any issue, but my local customer was overridden by the cloud customer, resulting in lost data. 

  • Both records (local + cloud) were normally marked as new, so it should be possible to detect this.
  • I was looking to find out if the ‘most recent’ record won, but in all scenarios, the data from the cloud database survived the synchronization.  (probably, because this database was added first)

Maybe something for a next CTP?

Deleting data that is being referenced by a non-synchronized table

Another interesting test was to delete an order line on my cloud database.  But this order line was being referenced by a record in my local database (OrderEvent).  Knowing that the OrderEvent table is not being synchronized, this should result in a conflict.

Here I did not receive an exception, but I also noticed that my record on my local database still existed, where my cloud record was deleted.  So here my data was out of synch. 

Maybe something for a next CTP?

Adding a third database

The last test I did, was adding a new empty (cloud) database to the sync group and after the synchronization, everything seemed to work automagically.  Great!

Underlying design

  • When configuring a table for synchronization, triggers are being created for the three actions: insert, update, delete.  They have the following name structure: [TableName]_dss_[action]_trigger.
  • These triggers add the data in new tables that are being created during configuration of the sync group.  For every data table, a sync table is being added with the following name: DataSync.[TableName]_dss_tracking
  • Next to that, we can also see that a lot of new stored procedures are getting created

Conclusion

This CTP looks already very stable and the sychronization (between multiple databases) is very smooth.  I am just hoping that there will be a better view or configuration for synchronization exceptions (like explained in the conflicts section).

Sam Vanhoutte, Codit

Categories: Azure
written by: Sam Vanhoutte

Posted on Friday, November 20, 2015 2:35 PM

Tom Kerkhove by Tom Kerkhove

Recently I have been working on an open-source project around Azure Data Lake Analytics. It will contain custom U-SQL components that you can use in your own projects and already contains a Xml Attribute Extractor.

In this brief article I'll discuss how you can use NuGet to allow you to perform automated builds for the Azure Data Lake Analytics extensibility.

To build my code I'm using MyGet.org because it builds & automatically packages it in one or more NuGet packages.

The only thing I needed to do was sign-up and point MyGet to my repository. Every time I push my code to GitHub it will automatically be built & re-packaged.

Unfortunately, the build service encountered some issues when it was building my code:

2015-11-06 02:14:23 [Information] Start building project D:\temp\tmp9931\src\Codit.Analytics.sln...

C:\Program Files (x86)\MSBuild\14.0\bin\ Microsoft.Common.CurrentVersion.targets(1819,5): warning MSB3245: Could not resolve this reference. Could not locate the assembly "Microsoft.Analytics.Interfaces".

Check to make sure the assembly exists on disk. If this reference is required by your code, you may get compilation errors.

Obviously the server didn't have the correct DLLs to build it, but first: how did I create my project?

With the Azure Data Lake Tools for Visual Studio you create a solution/project for U-SQL based on these templates:

For my custom extractor I have created a "Class Library (For U-SQL Application)" project with a corresponding Test-project. Once it's created it automatically references the DLLs you need.

The problem with these references are that they point to DLLs in the Visual Studio folder.

C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\ PublicAssemblies\Microsoft.Analytics.Interfaces.dll

As the build server doesn't have the tooling installed it can't find them. Unfortunately there is no NuGet package for this, however this is something that is under review.

But don't worry! I've wrapped all the DLLs in two NuGet packages so we can build all our stuff until there is an official NuGet package:

  • Install-Package TomKerkhove.Analytics.Extensibility - Contains all the DLLs for writing your own extensibility (NuGet page)

  • Install-Package TomKerkhove.Analytics.Extensibility.Testing - Contains the unit test DLL and installs the above package. (NuGet page)

By installing these packages you are overriding the existing DLLs & your build server is good to go!

Thanks for reading,

Tom.

Categories: Azure
Tags: ALM, Data Lake
written by: Tom Kerkhove

Posted on Friday, October 13, 2017 10:50 AM

Tom Kerkhove by Tom Kerkhove

A few weeks ago, Microsoft held another edition of its Ignite conference in Orlando, FL.

After going through most of the announcements and digesting them I found that there were a couple of interesting ones in the security & data space.

Let's have a closer look.

Introducing Virtual Network Service Endpoints (Preview)

With the introduction of Virtual Network Service Endpoints (Preview) you can now protect your Azure resources by moving them inside a VNET and thus restricting access to that VNET or subnet itself.

Currently, this is only supported for Azure Storage & Azure SQL Database/Warehouse but the end goal is to provide this for all services.

By using VNET Service Endpoints you can now fully isolate your resources because you can now fully remove all access to the public internet by which you are limiting the risk of exposure.

It has been a long-awaited feature to isolated access, certainly for Azure Storage & Azure SQL Database, and I am excited and very happy that it's finally here!

Additional resources:

Introducing Azure Data Factory 2.0 (Preview)

This must be my favorite announcement - Azure Data Factory 2.0 (Preview)the next generation of data integration.

While Azure Data Factory 1.0 was limited to a data-slicing model only, it now supports different types of triggers such as webhooks.

With Azure Data Factory 2.0 comes the new Integration Runtime that provides you with the infrastructure to orchestrate data movement, activity dispatching & SSIS package execution, both in Azure & on-premises.

But that's not all, there is more - Http activity support, integration with Azure Monitor, integration with Azure Key Vault, and much more! We'll dive deeper into this announcement in a later article.

Additional resources:

Azure DDOS Protection Service (Preview)

Distributed Denial-Of-Service attacks can be brutal and unfortunately is very easy to use. Nowadays, you can find it on the internet as a managed offering or even do it yourself just like Troy Hunt explains.

That's why Microsoft is announcing Azure DDOS Protection Service (Preview) that allows you to protect your Virtual Networks in order to secure your Azure resources even more.

However, Microsoft Azure already brings you DDOS protection out-of-the-box. The difference here is that Azure DDOS Protection Service takes this a step further and give you more features & control.

Here is a nice comparison:

Azure DDOS Protection Service is a turn-key solution which makes it easy to use and is integrated into the Azure Portal. It gives you dedicated monitoring and allows you to define policies on your VNETs. By using machine learning it tries to create a baseline of your traffic pattern and identifies malicious traffic.

Last but not least, it also integrates with Azure Application Gateway allowing you to do L3 to L7 protection.

Additional resources:

Taking Azure Security Center to the next level

Another example of the security investment by Microsoft are there recent announcements for Azure Security Center. You can not only use it for cloud workloads but also for on-premises workloads as well.

Define corporate security standards with Azure Policy (Limited Preview)

Azure Policy allows you to define corporate standards and enforce them on your Azure resources to make sure that the resources are compliant with your standards. They also come with some default rules, such as running at least SQL Server 12.0 and can be scoped to either a management group or resource group level.

By using initiative definitions, you can group one or multiple policy definitions as a set of requirement. An example could be an initiative that consolidates all SQL database related definitions.

To summarize, Azure Policy allows you to define security standards across multiple subscriptions and/or resource groups making it easier to manage your complete infrastructure.

It is currently in limited preview but sign-up for the preview in the Azure portal.

Introduction of Security Playbooks

With the addition of Security Playbooks you can now easily integrate certain playbooks in reaction to specific Security Center alerts.

It allows you to create & link an Azure Logic Apps which orchestrates the handling of the alert, tailored to your security needs.

Investigation Dashboard

Azure Security Center now provides a new visual, interactive investigation experience to analyze alerts and determine root cause analysis.

It visualizes all relevant information linked to a specific security incident, in this case an RDP brute force attack.

It makes it a lot easier to get the big picture of the potential cause, but also the impact of the incident. By selecting certain nodes in the equasion, it provides you with more information about that specific segment. This enables you to drill deeper and get a better understanding of what is going on.

However, these are only a subset of the announcements, you can find all of them in this blog post.

Additional resources:

Introducing SQL Vulnerability Assessment (VA)

SQL Vulnerability Assessment (VA) is a new service that comes with Azure SQL Database and SQL on-premise via SQL Server Management Studio (SSMS).

It allows you to discover, track and remediate potential database vulnerabilities. You can see it as a lite version of Azure Security Center focused on SQL DBes that lists all potential vulnerabilities after running a scan.

This is another example of Microsoft making security more approachable, even if you are not a security expert. After running a scan you will probably see some quick wins making your database more secure step by step.

Additional resources:

Summary

Microsoft made some great announcements at Ignite and this is only the beginning, there were a lot more of them and I recommend read more about them on the Azure blog or watch the Ignite sessions on-demand.

Personally, I recommend Mark Russinovich' interesting talk called "Inside Microsoft Azure datacenter hardware and software architecture" which walks you through how Azure datacenters work, their recent investments & achievements and what their future plans are.

Lately, the IT side of Azure is coming closer to the developer side where services such as Azure Networking is becoming easier to integrate with PaaS services such as Azure Storage & SQL DB. It looks like this is only the beginning and we can expect more of these kinds of integrations making it easier for both IT & Devs to build more secure solutions.

Last but not least, don't forget that the Azure Roadmap gives a clear overview of what service is at what stage. Here you can see all services that are in preview for example.

Thanks for reading,

Tom Kerkhove.

Categories: Azure
written by: Tom Kerkhove

Posted on Wednesday, December 30, 2015 2:00 PM

Tom Kerkhove by Tom Kerkhove

As of Wednesday 28th of October, Azure Data Lake Store & Analytics are now in public preview allowing you to try it out yourself. You won't have to worry about any clusters and allows us to focus on our business logic!

To celebrate this, I'm writing a series that will take you through the process of storing the data in Data Lake Store, processing it with Data Lake Analytics and visualizing the gained knowledge in Power BI.

 

I will break-up the series into four major parts:

  1. Storing the data in Azure Data Lake Store or Azure Storage
  2. Aggregating the data with Azure Data Lake Analytics
  3. Analyzing the data with Azure Data Lake Analytics
  4. Visualizing the data with Power BI

During this series we will use open-source data from StackExchange.
This allows us to deal with real-world data and how that might cause some difficulties.

In my next post I'll walk you through the steps to upload the data and how we can do this in a cost-efficient way.

Thanks for reading,

Tom Kerkhove.

Categories: Azure
written by: Tom Kerkhove