Wednesday, June 19, 2013

Using LINQ to SQL for ETL projects

I've spent part of the past 18+ months managing a team that's been building a large-scale web application from the ground up. One of our many tasks was to move existing client data from their current web application to the new one we were working on. Having spent a few years as an ETL programmer, I decided to take on the project myself. As a manager, I think it's important to stay in the code when you can, and this looked like a really good opportunity for me to do that.

Both the old and new web applications used SQL Server databases, and had somewhat similar purposes, making the migration process between the two pretty straightforward; I just needed to decide how to write it. In the past, this is something I probably would have done with some T-SQL code. But the development team was using an OR mapping tool for the new web application, and I was really curious to see if that could work for an ETL process.

Why not just do it the old fashioned way? Well, anyone who's written a T-SQL data migration knows where the pain points with that approach are: buildout time and debugging. The appeal of using OR mapping to me was that I could quickly write the process in C#, and be able to easily step through all the code to find and fix problems. I was going to be moving years of data for multiple clients; it needed to be accurate, and I needed to know when and where it wasn't, and why. To me, these two reasons made trying the LINQ to SQL route at least a worthwhile experiment, if not a potentially better approach.

Ultimately the project was a success, but I ran into a few stumbling blocks along the way, mainly with poor performance. Buildout for the utility was very fast as I had anticipated (only a few work days), but I had to go back and rework a number of the steps during the testing process as I figured out how to address various performance bottlenecks in LINQ to SQL. Below is a "project retrospective" of sorts; some tips, tricks and tools I learned along the way.

1. SQLMetal rocks

The first time I ran sqlmetal.exe was one of those epiphany moments developers sometimes experience when they discover a sweet new tool: the heavens open, a bright light envelops you, and a choir fills the room. That might be a slight exaggeration of the event, but it is a slick utility. With a single command-line entry, you can generate a .dbml file for any SQL database in seconds. Just add this file to your project, put a using; reference in your main code file, and blammo! - you can access any object in that database. Simple and fast. If you haven't used it before, I highly recommend giving it a try; here's a nice walkthrough on getting started.

2. Turn Off ObjectTrackingEnabled in your source DataContext

This is a really simple change, and should be implemented if you will not be modifying the data in your source database during the process. Just add the following line of code right after you initialize your connection to the DataContext of your source:
  • MyDataContext.ObjectTrackingEnabled = false;
Microsoft's documentation states, "Setting this property to false improves performance at retrieval time, because there are fewer items to track." Sounds good to me.

3. Use 'DELETE' in-line SQL commands instead of DeleteOnSubmit()

In my project, I needed to clear out my destination database before each run of the ETL process, and saw major performance gains by making this change. I had about 50 tables to reset, and was able to drop the runtime for this step from 45 minutes to 4.5 seconds. If you need to do any delete steps:
  • USE - MyDataContext.ExecuteCommand("DELETE FROM [Table]");
  • NOT - MyDataContext.Entity.DeleteOnSubmit(object);
Sorry all you purists out there; in-line SQL is just so much faster in this situation.

4. Use explicit selects when creating your LINQ objects

You only want to bring back fields and data from your source database that you're actually going to move. If a table has 15 fields and you're only moving 2 of them, leave the other 13 alone by using specific select statements:
  • USE - var Records = from x in DataContext.TableObject select new {x.Id, x.Value};
  • NOT - var Records = from x in DataContext.TableObject select x;

5. If you need to access a data object more than once, move it to memory first

As obvious and straightforward as that sounds, I'll admit I didn't do it in the initial version of my application. It was pointed out to me by a teammate during a code review, and to his credit, improved performance for the LINQ queries where I implemented it by 10x. I found a good number of opportunities to use this technique within loops:


foreach (var Record in SomeCollection) {
    string Code = (
        from AS in DataContext.ActivityStatus
        where AS.StatusId == Record.StatusId
        select AS.Code
} // ('Code' value will be taken from the database each time through the loop)


var ActivityCodes = (
    from AS in DataContext.ActivityStatus
    select new {AS.StatusId, AS.Code};
foreach (var Record in SomeCollection) {
    string Code = (
        from AC in ActivityCodes
        where AC.StatusId == Record.StatusId
        select AC.Code
} // ('Code' value will now be taken from memory each time through the loop)

Far fewer trips to the database yields much better performance. Duh.

6. Avoid using InsertOnSubmit() whenever possible

This function is by far, hands-down, unequivocally the biggest hindrance to moving your data in a reasonable amount of time. I learned the truth about InsertOnSubmit()the hard way in my project.

Here's an example - I had a simple table with an integer ID field and a varchar(50) string field. Oh, and it contained about 1.1 million records. Despite the size, you'd think that since the data fields are so simple, it should be pretty quick to move, right? Nope. On the first run of my application, it took a whole work week just to move this one table! That's a blazing speed of about 150 records per minute. *sigh*

So now what? Scrap the project as a learning experience and start over? Nah, I don't give up that easily. After a couple hours of research, I found the solution: SqlBulkCopy. There are a number of examples online of how to implement this class, but ultimately I decided to roll my own, since most of them had extra code I didn't need. My class is available here for download. To use it, you just need to create a List of your LINQ objects, then call BulkProcessor.Cram() when you're ready to add the contents to your database:

List<DataContext.LinqObject> ObjectsToInsert = new List<DataContext.LinqObject>();
foreach (var Record in SomeCollection)
    DataContext.LinqObject PersonObject = new LinqObject()
        Name = "A. Person",
        Address = "123 Main St.",
        City = "Anytown",
        Zip = 12345,
        Notes = Record.TextValue,
// List has been built; now use SqlBulkCopy to insert!
BulkProcessor.Cram(ObjectsToInsert, "PersonDetail", "[db_connection_string]");

Simple as that. Now that million record table takes minutes to move instead of days.


Overall, I'm quite satisfied with the result. I think building an ETL process using C# and LINQ is certainly a viable option for the right project and situation. Some final thoughts:
  • I liked how easily I could make changes to the process. When the development team made changes to the database schema, it was simple for me to update my OR map and migration code to account for it. When I found data a data issue, it was easy to alter the process to make sure it didn't happen again. Literally took a few minutes in most cases.
  • I now have what can be considered a prototype and / or reusable code for future ETL needs for the web application. For example, a new client comes on and wants data from their old system moved over. Most of the work is already done; I'd just need to change the source DataContext and a few select statements. Maybe they want a set of data sent to them from the system? Again, we now have most of the code needed to do this.
  • Just as a point of caution, the biggest database I've used this with so far is around 2GB. I'm not sure how it would perform with a dataset 5 to 10 times larger.
  • Be sure to heed my warning about InsertOnSubmit(), but be pragmatic about it. Use my result of 150 records a minute as a benchmark to determine when it's OK for you to use it instead of SqlBulkCopy.

I'd love to hear from anyone else who's attempted this with either additional tips or pitfalls to share, or with questions about my approach that I may have omitted.