If you only came here for the links:

This was a presentation made to the Austin Java User's Group in Austin, TX June 24th, 2014

Why You Do Care If Hadoop Is Too Mysterious

Developers care when there is a useful technology that won't get used because it seems too hard. Especially when it isn't that hard. Awkward as hell, maybe. But hard? Maybe not.

Hey, I believed it too.  Beware experts who might help you with Hadoop, who gain more from maintaining their godlike status than allowing easy stuff to seem easy. It's a fine line between being helped, and being misled. I've seen this line crossed, in this space.

Yes, You May Want Hadoop

Hadoop needs to be popular because it's one of many important tools in the big data space. You can do some pretty cool stuff that would be harder with other toolsets, and you don't have to have the zillion dollar budgets either. So don't give up before you start. It is possible.

Common use cases include spinning up a batch of servers, running a data analysis job, and spinning the servers back down like they had never existed. That's a useful thing to be able to do.

Made Simpler?

What follows in the linked presentation is my attempt to show why Hadoop seems really hard when instead, it is just awkward. 

Much of this will be dated information that will hopefully be proved obsolete with future generations of Hadoop. But if you are being stopped from finishing pilot projects in Hadoop right now mid 2014, because you couldn't get your head around it, at least this will address that. This will also address the crazies that happen when you have a lot of simple ETL jobs that seem to be taking much longer than it seems they should to develop.

First, I show that there are only 7 distinct steps that are required to load and query Hadoop. See the slide deck, if that's all you need.

Even if you decide to stop there, you've got something. Not enough, but something.

Automation of ETL? That Isn't Hard, Either.

This is also pretty important - automating Hadoop ETL (Extract, Transform, Load). Why? Because automation of the ETL process removes one 'awkward as hell part' of Hadoop - making the overall process much less awkward as a result. 

Once you automate the awkward part, you may discover that it is quite feasible to do some pilot projects, spin up some data sets and do some legitimate analysis.

The code for automation of an ETL is all here, use it as you will. Alternately, just use it as an inspiration for you own templates, which you would write. Simple, dumb stuff.

How Much Time Would I Have To Invest?

Scan over the slides first, then if you think there is enough there... invest an hour or so in the presentation. Then you'd have a better idea if further work inside Hadoop was worth your effort.

Automation: To Generate? Or improve the API?

Automation is shown in this presentation as generated code. This is out of respect for the current API only.

As I learned from Michael Nash 14 years ago, generated code is almost always inferior to a less leaky API. Someday, a better abstraction layer may make the auto-generated code you see here obsolete. This would be superior to what is shown here in some respects. But this is a topic for a different blog.

Extreme Mathematics

Once you've got your massive data sets inside Hadoop, that's when you might need to engage a data scientist and/or programmer more familiar with data analysis techniques such as clustering.  But that may not be so much Hadoop as data science and data analysis. Check out your local meetups, and also projects such as Mahout. This is where your bigger IT budgets can come in handy!

Code Links:

The code projects referenced in the above youtube and slides are also linked below:

Dev Code

Ops Code - Chef cookbooks/recipes