Three Ways I Got You To Read This Stupid Post

June 26, 2014 / b3llm4r / Leave a comment

A month or two ago I started experimenting with Outbrain, a little startup that promises to put links to your blog posts on top sites. It’s an interesting premise and they had offered me $50 to play with so… hey why not? Let’s share the Exversion love.

A few weeks into it I noticed an awful lot of activity around this blog, all coming from a quick throw away post I put up a few months ago: Five Things They Don’t Tell You About StartupBus. Outbrain had indexed all our content, but their algorithms really only liked that post.

It was easy to figure out why: obviously Outbrain looks to match relevant content by keywords in the post titles, rather than post content. So while Differential Diagnosis in Production might have ten times the technical content of Five Things They Don’t Tell You About StartupBus and Version Control for Data might be loaded with relevant material neither one has a very clickbait title.

The suggestion that Upworthy style “You won’t believe what happens next!” titles are more effective even on technical blogs made me a bit sad, but also piqued my curiosity. Was it the keyword “startup” in the title, or was it the “(Num) Things (blah blah blah)” format that made Five Things They Don’t Tell You About StartupBus so successful?

So I designed a quick experiment. I created two clones of this blog: We Make Data Sexy and Free the Data. Exactly the same content, just different titles. Some titles had stupid “(Num) Things (blah blah blah)” format, while some were just structured around relevant keywords (hacker, apps, startups, big data).

Then I created campaigns for both clones and let all three blogs compete to see which posts did better.

Here are the most popular posts on this blog before Outbrain:

Version Control for Data
Differential Diagnosis in Production
Five Things They Don’t Tell You About StartupBus
The Ethical Hackathon or: How we learned to put on a good hackathon and make hackers happy
Every piece of NYC’s real estate data is now accessible through our API
And we’re off, Exversion is now available to everyone.
Dealing with Complex Data on Exversion

Version Control for Data

Differential Diagnosis in Production

Five Things They Don’t Tell You About StartupBus

The Ethical Hackathon or: How we learned to put on a good hackathon and make hackers happy

Every piece of NYC’s real estate data is now accessible through our API

And we’re off, Exversion is now available to everyone.

Dealing with Complex Data on Exversion

Here are the titles that performed the best on Outbrain:

2 Reasons Why You Need to Version Control Your Data
The Best Startups Are Bad At Blogging
Pushing a Microsoft API to the Limits
Why Developers Won’t Build Cool Apps
Getting Out Of Big Data’s Torture Chamber
How to Make Hackers Love Sponsors

Okay… faith in humanity somewhat restored. Only one “put a number in it for no reason” title got any traction, but keywords definitely improved post performance. That got me thinking about how I come up with the titles of my posts.

I write for hackers, so in general I’m inclined to avoid keywords like “big data”, “apps”, and “startups” because when you put a title like that on something like Hacker News it drowns in the endless stream of identically titled company blog posts. Such keywords usually make the post seem too generic to have any interesting technical content or discussion. To capture more of an audience, I naturally lean towards titles that will stand out by being either super specific or playing on smart references.

The Outbrain experiment made me realize that even within an educated niche there’s a difference between speaking to insiders and going more mainstream. We were speaking to insiders, and our posts were very successful in that arena– generating lots of traffic organically from shares. But to a broader audience they landed with a thud. We were having trouble breaking out of that group of insiders, because we hadn’t found a balance between clickbait and accurate summary.

I don’t think we’ll go any further with Outbrain as– at the end of the day– the traffic they brought us didn’t really benefit Exversion, certainly not the way successful posts on HN have. But I’m definitely going to try to slip in a few more generic keywords into our titles from now on 😉

Obscure Data Formats: .px files

June 17, 2014 / b3llm4r / Leave a comment

I was answering some Data Requests this morning when I came across a download option on an open data site I was not familiar with:

Download PC-Axis file

PC-Axis…? What the hell is a PC-Axis file?

As it turns out PC-Axis is a statistical program developed by the Swedish government. It is used in Croatia, Denmark, Estonia, Finland, Ireland, Iceland, Latvia, Lithuania, Norway, Greenland, The Republic of Slovakia, Slovenia, Spain and Sweden, Taiwan, The Philippines, Kuwait, Algeria, Mozambique, Namibia, Uganda, South Africa, Tanzania, Bolivia, Brazil and organizations like the UN’s Economic Commission for Europe, East African Community, and FAO (Kyrgyzstan, Ghana, Kenya)

That’s surprisingly prevalent for a small freeware program only available to Windows users.

And since the data I wanted was in this strange .px format and there were no other options, I had to open it up and figure it out.

PX files are basically CSV files with a space as a delimiter and a whole bunch of metadata at the top.

Sweet! Fire up the python scripts~

However the data was all in German and the spec on PX format left much to be desired. So for those who encounter it later on, here’s what you need to know:

All the data is– conveniently enough– in the section marked as DATA
The header section is labeled HEADING but it may only include a variable reference, which will be defined in detail in one of the VALUE sections of the metadata
The section named STUB is actually the equivalent of a y-axis. So you can think of it as extra columns with categories the user might want to filter by. The data I was looking at was broken up by year and by month, so the STUB values were year and month. Like the HEADING the details are defined in a VALUE section of the metadata

Converting PX to CSV is actually pretty simple: replace the spaces with commas, copy and paste the HEADING to the top of the file and delete the metadata. If you want to keep the STUB columns a quick python script that reads the csv as a dictionary, adds the necessary values and writes another CSV file will do the trick.

Otherwise, R can read px files, so can Matlab with some hacking, fans of OpenRefine will be happy to know there’s an extension, and Node.js hackers can parse it with this package.

Data is The Colonial State of Tech

June 13, 2014 / b3llm4r / Leave a comment

When we started Exversion we wrote a list of all the problems we had dealt with working with data. Top of the list was actually finding the data we needed for a given project.

Like most people when we need something online the first stop is Google, but Google’s algorithms are designed to deliver content. Their spam prevention techniques often penalize data repositories because their listings don’t contain enough text to be seen as valuable and worthwhile content.

This interpretation of value as a matter of prose also penalizes stock image sites and file repositories. For that sort of stuff you have to know where to go before you can even touch a search box.

Stupid Ideas We Had: Let’s Build a Search Engine
We knew if we could find a way to solve this problem for others we would have one foot in the door with the rest of what we were building.

When you’re working on a startup there are certain obvious pitfalls that everyone knows they should avoid and yet nevertheless almost everyone falls into. It was important that whatever we built complimented our product vision. We didn’t want our efforts at improving accessibility taking away from what we were passionate about building (duh), but our first approach here was stupid because we underestimated how much work it is to build a real search engine.

That seems even dumber in retrospect, but we truly believed that with all the fancy new open source tools (like ElasticSearch) search would be simple.

It wasn’t simple and after a month or so I scrapped the project to go back to what we were passionate about.

Total Data Request Live
Two months ago I was sitting in a client meeting listening to a consultant go on and on about how people try to search for data, when it hit me. What if instead of a search engine, we built more of a StackOverflow system where people could request the data they were looking for and have it fulfilled by the community?

We had been working on Exversion long enough by that point to realize that a major problem for us was how heavily fragmented the data community really is. To be honest, there is no such thing as “the data community”. It is the colonial state of tech: several tribes with no common language, process, or experience roughly fenced in by an arbitrary border.

Building tools for people who work with data is different from building tools for people who write code. There are slight differences in culture across generations, races and genders in the coding community, but no where near the variance of the so called “data community”

Rather early on an advisor looked at this problem and aptly summed it up: “It doesn’t matter how great your technology is, you’re going to have to figure out a way to pull the community together first”

Easier said than done. The typical advice for building community comes down to a handful of limp “cross-your-fingers” solutions: comments, gamification, social sharing. We needed something better.

This week we launched the first step in that direction by adding Data Requests. It’s a simple system: you post what data you need, what sources you trust, and what you want to use the data for. People can comment on a request, push the request up by saying they need the same data, or submit an Exversion repository for review. Then the requester can select the best answer from the list and close the request.

We Make Data Sexy
When I opened up the feature to my friends a few days ago I ended up with a couple of emails from people who do not identify as part of the “data community” or even play a technical role. They had long wish lists of data they were trying to find for various reasons and were excited by the idea that maybe there might be a place they could go to just to get some guidance on where to even START looking.

So we’re excited about launching this officially and working on promoting it as much as possible. While the differences in experience, perspective, process, and technical literacy that keep the data community for actually being a community may be a disadvantage for us, it’s also an incredible opportunity. What kind of innovation might happen if all these different parts actually worked together?

Pushing a Microsoft API to the Limits

June 9, 2014June 8, 2014 / b3llm4r / 2 Comments

I’ve wanted to build a plugin for Microsoft Excel for a really long time, but not being a .NET developer I assumed this was something we’d have to outsource at a later date.

Then a few months ago I discovered two things: 1) Microsoft Office 2013 has a new “app” system for HTML5 apps and 2) You can build apps in a complete online development environment called Napa.

Not having to download any software and being able to write code in HTML/javascript made the concept of Excel integration something I could prototype myself, which meant we could try it out right away without spending any money up front and invest in backwards compatibility if things worked out.

This was going to be easy, I thought, and fun!

I was only half right.

The Agony and Ecstasy of Microsoft
Like most developers of my generation, I started out on Windows. For years I knew every trick to hack Microsoft technology, pushing machines that were 10~15 years old to their limits, cobbling together custom machines with spare parts pulled from dead units … that was the way you did it back then. Linux was barely an operating system and Mac restricted your software options too much. We were all on Windows.

What’s weird about Microsoft as a product line, if not a company, is that they don’t generally lose because their competitors beat them. They lose because they sabotage themselves. What finally got me off Windows was an external wifi card. The wifi card was specifically designed for Windows XP machines. I plugged it in to a laptop specifically designed for Windows XP with a valid version of Windows XP freshly installed … and the damn thing didn’t work. No matter what I tried the machine and wifi card would not play nicely together.

But when I threw out Windows altogether, installed Red Hat and jerry-rigged a Linux driver from another manufacturer everything worked perfectly. I had it set up in under an hour and the configuration worked up until the day the machine died.

That’s when I learned an important life lesson:

Microsoft + Microsoft + Microsoft == FAIL

We revisited this lesson as a company a few months ago when we attempted to move Exversion to Azure servers. I was actually really excited about this move. I liked what Microsoft Evangelists had to say about Azure. I thought perhaps Microsoft had been humbled enough to learn from its mistakes. I was eager to give them a second chance.

And then we actually tried to do it and the move was an unmitigated disaster. Azure’s system is pretty robust and visually quite appealing, but its documentation is spread out over multiple sites, blogs, and mailing lists. Deprecated information is almost never properly identified. Basic configuration changes often required destroying the server and starting over. After hours of research I still do not understand how to set up backups. Too many tutorials are written with the assumption that you are on a Windows machine, interfacing with your Windows server through Microsoft PowerShell.

While Microsoft Evangelists were quick to celebrate Linux servers as a general idea, essential documentation was always written as if Ubuntu was a fringe product only a small group of extremists would prefer over the Microsoft alternative.

It was incredibly annoying and resulted in the loss of valuable time and money. In fact I would go as far as to say it nearly killed the company.

The Developer Friendly Microsoft?
But the opportunity to finally crack open the Excel market was just too good to pass up. I was ready to give Microsoft another chance and try to build something really powerful and awesome.

And for the most part I think it paid off. I’m pretty happy with the alpha of Exversion for Excel. Even if the development process was slowed down a little by typical Microsoft BS, because this is HTML5 and javascript prototyping was still pretty speedy.

Here are some things to consider before you get started:

The javascript API is young and still a bit immature in the options it gives you.
While documentation is still on multiple sites for multiple versions with very little indication what has survived previous incarnations and what has been scrapped completely, the information on the main site is generally sufficient to answer most questions. It’s safe to say that if it’s not on that site, the information you are reading is wrong, out-of-date or not relevant. (I expect that as the API matures Microsoft’s apparent inability to keep all their documentation in one authoritative place will be more of a problem)
You DO NOT need to download Visual Basic Studio or any other Microsoft product in order to develop.
Microsoft’s Developer site for Office 365 is a labyrinth worthy of a David Bowe cameo. Took me forever to figure out that in order to continue working on my new app the day after I created it I had to go Login > Admin > Office 365 > Build Apps and then choose Napa Office 365 Development Tools OVER the link to my app that appeared right above it on the same list of links

Building: What are Microsoft’s Actual Goals?
The main problem I had with using Microsoft’s javascript API was that the API did not seem to be designed for ambitious projects. It’s one thing for the tutorials to focus on small simple example apps (the Hello Worlds and whatnot), but I’m not sure how Microsoft expects developers to build anything more complicated than a few flashy alternatives to Google charts. Maybe that’s all they really want out of this.

Here’s some of the issues I had to work around in building Exversion for Excel:

Grabbing data from a selection does not return any information about WHERE in the workbook the selection is. You can get the number of rows and the number of columns, but not the sheet info or the exact location of the rows and columns within the sheet.
Listeners on bound areas do not persist, which means running the entire app on one HTML page or iterating through all the bindings and resetting their listeners every time a user moves from one section of the app to the next. While this is almost certainly the nature of javascript, I would have expected Microsoft’s default library to manage this concern for me.
There is no way to manipulate the size of a selection programmatically. Exversion for Excel loads data in and out of Excel through Exversion’s API. However the method to read/write data in a spreadsheet is ridiculously anal. It needs the selected area to match the rows and columns of the data EXACTLY. Fine when we’re reading– the user expects to have to highlight all the data they want– but a real royal pain in the ass when writing data to the sheet. The application now has to instruct the user to highlight EXACTLY five columns and three thousand rows … yeah right. Even if the entire sheet is empty the API will not insert unless the selection is exactly the size of the data to be written. A couple of methods to adjust the size of the selected area programmatically would open up a whole WORLD of options to the developer. Sadly it does not exist.
Adding new data requires the user to understand what’s going on under the hood. If you’re going to add data to a section, where are you most likely to add it? To the next available empty row … which is outside of the bound area … ergo the listener will not realize that the data in the bound area has changed and fire. Since we cannot change the selected area programmatically we cannot add a few extra empty rows to our bound area to control for this possibility, and even if we could the API would complain about the selected area being TOO BIG for all our data. Since the api does not return any information about the location of the selection we can’t even give the user the option to highlight the new data and add it in to the bound section either.
When a listener fires it returns not what has been changed, but all the data in the selection. This makes sense for a graphing app, but not much sense beyond that use case. It means in order to find the change the app has to iterate through the data. The good news is that the listener fires every single time a change is made and the changes that can be made on multiple rows at once are pretty limited in Excel 2013. So once you’ve found the difference you can stop. The bad news is I can’t imagine this is going to scale well to thousands of rows.

Microsoft Being Microsoft
I managed to find user friendly work arounds for most of those issues. I finished debugging and testing Exversion for Excel. I was satisfied that we were ready for an alpha launch, but I wanted to run the app on the real deal first. See how it handled, push it to insane limits and begin to figure out what to work on for future iterations.

That’s when Microsoft went back to being Microsoft.

See one of the nice things about Microsoft’s Developer’s Accounts is that you get copies of the latest version of Office for free. But when I downloaded Excel for Mac and booted it up I couldn’t figure out how to install an app. I looked under all the menus, digging around in plugins and add-ons… nothing about apps.

Then I went to Google and found this.

For starters, Office 2013 doesn’t mean anything to the Mac: it’s for Windows computers only. A subscription to Office 365 ($10 per month, or $100 per year) gives you the right to download Office software to up to five computers. For Mac users, what you’ll download is Office for Mac 2011—it’s pretty much the same version of the suite that we’ve been using for a couple of years now, but it’s been updated to include activation for Office 365 Home Premium.

So BAM! in the blink of an eye the available user base for my brand new app is slashed. I knew the number of people using Office 2013 verses some later version would be small, but I figured that would make it the ideal testing ground. Now we’re looking for a tiny number of people who 1) use Excel for most of their data work AND 2) are on Windows AND 3) have Excel 2013 or prefer Office 365 to Google Docs.

Dear Microsoft: If you are wondering why there are barely over 100 apps built for Office 2013 this might be why. Go to some non-Microsoft developer events, a hackathon maybe, sit in on a few computer science classes. It is a SEA of MACS. You have the right idea with supporting HTML5, but the developers you’re looking to bring back to the Microsoft fold are all on Linux or Mac. The main reason why developers work on independent projects like apps is that they want to build something they themselves will use.

Thank God for Virtual Machines.

	latuji on Guide to Data Science Com…
	neil on Guide to Data Science Com…
	Ashish Dutt on Guide to Data Science Com…
	martijn on Guide to Data Science Com…
	DailyTekk (@DailyTek… on Be As Evil As Possible: How We…

Happy Endpoints

A data blog by the Exversion team

Month: June 2014

Three Ways I Got You To Read This Stupid Post

Obscure Data Formats: .px files

Data is The Colonial State of Tech

Pushing a Microsoft API to the Limits