Be As Evil As Possible: How We Got Our Competitors to Promote Us On Twitter Without Knowing It

I often joke that if Google’s motto is “Don’t Be Evil”, Exversion’s motto is “Be As Evil As Humanly Possible”. We are– to be frank– sneaky bitches who enjoy being sneaky bitches.

When your company is tiny, bootstrapped, and marked for death by just about everyone who matters in startup land being evil becomes a righteous pursuit, a sort of noble savagery that inevitably charms people who remember what it’s like to fight for the survival of your dream.

For me the distinction between (forgive the term) endearing evil and the sort of startup evil that creates backlash and bad PR is power dynamics. Are you being evil to protect your power base? To hoard more privilege than you need at the expense of others? Or are you being evil because you’re on the losing side of an inherently unfair game? Are you struggling against silly rules and resource-controlling systems that people with the right friends or the right names on their degrees get to bypass?

Is it a shock to anyone that this is not a meritocracy? No? Have we finally dispelled this notion? Good.

Sure, if I’m any good at my job there will come a time when Exversion will be big enough and strong enough to get it’s way and then it will fall on me to channel my natural sneakiness into other, less malevolent, pursuits … like … I don’t know, creating elaborate honeypots to annoy Chinese hackers maybe.

Thankfully today is not that day so I can spend my time passively highjacking my competitors Twitter accounts to promote Exversion.

First let me clarify exactly what I mean by “highjack”: No accounts were hacked, no passwords compromised, no spam was sent to anyone either via direct message, tweet or other sources. The victims of my shameless evil had no disruptions in their normal Twitter experience. The security and stability of Twitter as a service was not violated.

Here’s how we did it.

Like All Things Evil, It Started On Facebook

Someone, somewhere, at some point an article from DailyTekk landed on my timeline. I had never read DailyTekk before and soon found myself pulled into a vortex of links. What are the 10 best box-on-the-month subscription services I’ve never heard of? Top seven apps for organizing my life? Go on…

Page after page of articles listing apps I had never heard of but immediately wanted to try. I just kept browsing, going deeper and deeper into the site until I had ten tabs open with different DailyTekk articles.

At some point I came across fledgling Twitter engagement service Flounder. This is when the fun started.

Flounder is a neat little idea: it tracks the activity of your employees’ (or team members’) personal Twitter accounts and looks for interactions. If a conversation between a team member and another Twitter user goes longer than three interactions, Flounder follows the user your team member is talking to from your startup’s account.

So in other words, chat with @bellmar a bit and @exversiondata will automatically follow you. If you don’t follow back, after a couple of days @exversiondata unfollows you and moves on.

Well, what could it hurt? I figure … I’m not especially active on Twitter anyway, let’s give it a try and see what happens.

While I was setting up an account for @exversiondata I realized that Flounder asks you for the Twitter accounts of your team members and never… actually … you know CONFIRMS that they work for you in any way. No authentication. No message that you’ve been added to a group. Nothing. So I added Jacek to my group. True, he left the company months ago… but since he still lists Exversion on his Twitter bio I figured he was fair game.

Then I got to thinking: there was nothing stopping me from putting my competitors on my Flounder team. That way I would be automatically reaching out to people passionate about data without having to do any work actually identifying them.

Oh but then I thought… wait. Why stop there? I can take this to a whole ‘nother level.

The Follow Back Is Dead, Long Live the Follow Back

When Twitter was new, following people was an excellent way to get them to follow back. Not so much anymore. But then follow back isn’t the real goal here. Getting more followers would be nice, but what we really want is to expose new people to Exversion. Get them curious, so that they click on the link in our Twitter profile and maybe sign up for an account. Alternatively they might click on our timeline and read a few blog posts on this blog, that would be good too.

The following acts as a first contact point. The user will get an email, letting them know we exist and telling them a little bit about us. For some that will be enough to piqued their curiosity … I read somewhere that it takes at least three touches to trigger a conversion. Then I read somewhere else it takes 8~10 touches. So whatever, let the bs marketing people split hairs over this, the point is one contact is not going to be enough. We need to follow up with these new friends.

Luckily, there’s a really easy way to do this: favorite their tweets. According to Adorer 30% of people follow your account after you favorite one of their tweets. This was actually not the first time I had heard something like this.

So I wrote a quick script that picks the most recent accounts @exversiondata is following, pulls their tweets, looks for the word “data” and favorites one tweet.

Then I registered a cron job to have this script run every day at 10 am.

As Flounder finds new people to follow, every day for two or three days those users get a notification that we have favorited one of their amazing, brilliantly written tweets. They don’t get spammed incessantly with unwanted contact. They just get pinged, casually and unaggressively a few times.

The Results

The first day using Flounder I noticed that @exversiondata had followed my friend Jay who I often converse with over twitter. So the damn thing works, awesome.

The second thing I noticed is that @exversiondata is suddenly following the twitter accounts of the guy who made Flounder and Flounder’s parent company– SNEAKY BITCHES!! Although I suppose given what I’m up to I don’t have the right to complain. LOL. Okay whatever, they can have a follow back if they want. I don’t mind.

By the end of the day I not only had a couple follow backs, but a query from someone who wanted to use our service for a big scary data project, and a handful of new followers not at all connected to this scheme (thanks Twitter algorithms!). The rate of follow backs stayed at 10%~15%  throughout the whole experiment.

This was a big surprise. If you had asked me before this experimented started I would have said simply following someone was not enough to entice them to follow you back. People are too cynical, too bombarded with corporate messaging, there are too many spam bots on Twitter. It will never work.

But it does work, maybe not as well as it did when Twitter was new and people were more naive but if you’re targeting individual users most likely to be interested in what you’re doing simply following people does still have an impact.

And that’s not even the best part of this experiment. The most dramatic change was in traffic from Twitter, which more than tripled in a week. This is the data on referrals from Twitter, starting a month ago … can you tell when this experimented started?

Screen Shot 2014-08-24 at 9.34.10 PM

Even more impressive is how much longer these the average visitor stayed on the site during this experiment:

Screen Shot 2014-08-24 at 9.30.34 PM

Slow Burn Hacking

The growth was gradual, a few people pinged every day, but effective. After all, too much and we would have risked attracting negative attention from Twitter itself, which sort of defeats the point doesn’t it?

As much as we all wish for the excitement of instant explosions of growth, the slow burn is harder to track and more closely resembles legitimate user behavior. Because the bot doing the favoriting was only drawing from a pool of users identified through the networks of data companies it was unlikely to favorite something unsuitable by accident. In general it would only favorite three-four tweets a day and follow about 15~20 people a day hardly the kind of automatic management that tends to piss of Twitter. (Of course depending on how well this post does on HackerNews, we get be suspended tomorrow!)

Other than occasionally picking up the odd random account (DominosCareers? Really?) the system was really quite good at identifying people we would otherwise want to follow anyway and helped us participate in some really interesting data conversations.

Another fun, incredibly evil, side effect of this kind of hack is that the people who are most likely to tweet a company directly tend to be customers with problems. So you are pinging people right when they are most likely to consider alternatives to whatever they are currently using.

On the other hand, it’s hard to tell how effective this strategy will be outside the data community. Despite the great fanfare over “Big Data”, there are too many underserved areas. Only a handful of startups working on open data, version control, normalizing and cleaning data. A few more than that working on the collection of data itself. In other markets where every five minutes there’s another startup launching the same tactic may not have the same effect.

Still, what warms the evil core of my heart is the fact that the better your competition’s social media strategy is, the more effective they are at reaching out and developing conversations with potential customers, the easier it is for you to identify these people. Their outreach delivers the best potential evangelists for your project straight to your door, gift wrapped.

I Submit to Our Robot Overlords Because I’m Kind of Lazy

About a month ago I bought a robot vacuum. The damn thing is mesmerizing… the way it wonders around my apartment, getting into places I didn’t think it could, finding new ways to completely screw up my cowhide rug. The other day while cleaning it out I realized it had managed to suck up one of my battery pack adapters that must have fallen onto the floor.

It’s pretty amazing. I run it while I’m cleaning other things or when I hop in the shower and it cuts my list of chores down by at least one big task every week.

I’ve began to apply a similar philosophy to Exversion and last night you may have received the first of many regular robot assembled mailings.

We have never made good use of our mailing list, everyone on the team always has other things they would rather do first. And up until recently it was always such a chore to figure out what we should say… new feature alerts? Sure, but how often do those happen? We blog about them too so do we just copy and paste from the blog post or do we link the blog post or do we have to write whole new content just for the mailing? AGH.

A few weeks ago we started something called Thoughts from the Exversion Team because often we find ourselves wanting to write something about data, data science, or the multitude of data tools that exist but not wanting to write a whole blog post about it. The blog hooks into Facebook and Twitter. It seems bad form to spam everyone with three paragraph posts all the time. There might be one new blog post a week, there are sometimes two or three thoughts posted in a day.

Once that got started though, with Data Requests, Data Thoughts and actual freaking data we had enough content and enough diversity of content to put together an interesting regular mailing.

In addition to pulling the best of our own content, our mailing list also pulls the best data news of the week from HackerNews. So each week you’ll receive the best stories concerning the wide wonderful world of data, curated by a crowd of cynical hackers with impossibly high standards. That alone should make it worth while. Who wants to be on HackerNews all the time?

Of course the robot doesn’t have full control of the mailing list … that would get gross really quick. No, the robot assembles the campaign and then forwards it to me for my approval. So there’s always a human pair of eyes making sure the content is quality.

If you’re not on our mailing list, obviously now is the time to join.

PyGotham, Open Source and Bad Data

A month ago a good friend of mine asked me if I would consider giving a talk at this year’s PyGotham. I enjoy speaking at conferences, so I almost never turn down an opportunity. Python being one of my favorite languages the only issue was what to talk about.

“Can I do it on debunking other people’s data science?” I asked.

A lot is written about and lectured on various libraries and modules to load up data and do all kinds of analysis. Particularly in the python community. Python is slowly becoming the language of data. True there are other options that are specifically tooled for analysis (R, Julia, etc) but python combines a lower barrier to entry (easy to learn, already installed), powerful options, and a large active community.

What we don’t talk a lot about is data quality. Most talks on data science start off with “so we take our data and…” with very little comment on getting the data and prepping the sample. And yet it’s these two stages that are the most complicated, require the most careful thought, and where mistakes are the most damaging. (Just today the NYT published a piece on this very problem)

There are also very few tools to help identify and prevent these issues.

My original plan was to walk people through these types of mistakes with real world examples. I was a little concerned about how little python would actually be included in this discussion, but I kept telling myself data science is a huge part of the python community and this is a huge issue in data science. These are not problems that only trip up students and the intellectually inferior. It was ridiculously easy for me to find examples from major publications like The New York Times and respected blogs like FiveThirtyEight. The consequences of bad data are everywhere.

Then in the middle of  the conference a bolt of inspiration hit, we write code that tries to validate and unit test our processes all the time. The main problem with data science libraries is that they assume that you understand all the caveats and proofs associated with each model. But why can’t we write a library that would analyze a dataset and give you feedback on structural issues, potential sampling errors, normalization issues, etc?

So I went home, bought a six pack of beer, ordered a pizza and changed half of my presentation.(1)

We Are Open Source

We have been working on isolating and open sourcing different components of Exversion‘s technology for a while. It’s slow going, largely because with a small team I always have to choose between building new and modifying the old to the extent that it needs to be in order to click into place on someone else’s stack. Mind you, we have tweaked our processes to build things that way the first time, but it took us a while to get those habits in place mentally.

Right now our major open source projects are as follows:

Junky: Dataset Profiling

The project I pitched at PyGotham is called Junky, it got an amazing response with four or five people volunteering to be contributors right on the spot. While I originally called it a dataset validator for lack of a better term. Eric Schles smartly described it more as a dataset profiler… which I think better captures what it will be able to do. A profiler doesn’t tell you that your code is good, just how much time it takes to execute, how much memory it uses, etc. From there you may choose to clean things up, or you may not. The profiler’s job is to show you what’s going on that you might not see otherwise.

Likewise, Junky will not tell you if your data is good or bad, but it will measure things like how consistent your categories are, test for normal distributions, how robust your sample population is, identifying outliers, etc. These are things that anyone with any formal training in data science learns to do before cracking out a linear regression model, but with a lot of people coming into the data science frontier from programming, sometimes the basic first steps are missed. Self-taught data scientists sometimes jump directly to executing commands in a stat library, without knowing very much about the requirements of the models they’re using. Most of the common statistical analysis methods assume, for example, that your population data has something resembling a normal distribution (ie – a bell curve) but it is ridiculously easy to collect a sample that isn’t normal in the statistical sense and that will be something you want to know before you do your analysis and draw conclusions.

When you’re self-taught you learn best through mistakes. Most major mistakes in programming blow up in your face really quickly. But in data science critical errors can go unnoticed for long periods of time, crippling the passionate beginner’s ability to learn and improve.

Junky is intended as a way to help people who want to do data right, explore those problems on their own.

Data Cleaning Boilerplate

Another general data tool I’ve been working on is the Data Cleaning Boilerplate. The concept is simple, I write a lot of data cleaning scripts. Usually I end up writing new scripts from hacked together, copy-and-pasted bits of old scripts. At some point I decided it would be really useful to start writing more generic functions for things I end up doing over and over again so that I can copy and paste from the boilerplate and get things done faster.

Exversion Layer

Layer is a stand alone version of Exversion’s version control system. It hooks into Postgres and uses a RESTful API to receive and return changes in data state.

Exversion Server

Exversion Server will be an open sourced version of Exversion’s data store technology. While there isn’t much to look at now, I’ve put this project back on the main dev schedule for the coming months largely after conversations with Rufus Pollock of OKFN and the HDX team at the UN. CKAN, which HDX uses, has a data store module that appears to be setup exactly the same way we set up the prototype of Exversion that we hacked together on the bus down to SXSW. We ditched that model for very good reasons when we came back up to NY and so I think it will be worth while to release something that can be hooked into CKAN as an alternative. Will blog a more detailed explanation of my thinking in this regard when that is ready to go.

All of these projects welcome potential contributors, so if you’re interested please file an issue letting us know what you’d like to improve about them.

———————————-

(1) I’ll put the final presentation online as soon as PyGotham releases the video. For now here are the slides.

Does Failing At Startups Make You More Successful Than Succeeding At Them?

Springtime For Hitler, The Startup

The other day someone asked me how things were going with Exversion, a question that startup founders get asked every time we pull our noses out of our Cup ‘o Noodle and dare to go outside. Let’s be honest, the person asking is expecting reports about progress and achievements, but given the odds of startup land to the person answering that question it’s more about benchmarking where you are on the road to failure.

In my case it was impossible to answer honestly. Things were great: I asked my last remaining cofounder to leave four months ago, I’d terminated a free hosting arrangement and put Exversion’s small collection of expenses entirely on my credit card, I’d stopped meeting with investors and had taken a job that made it impossible to work on this company full time.

Everything about the current situation screamed failure, and yet since I had started the downward spiral I signed a lease on my dream apartment in my dream neighborhood, paid off all of my outstanding credit card debt, developed a steady plan for paying my student loans off, stabilized and strengthen Exversion’s infrastructure, hired three people, began a collaboration with Microsoft to integrate our API into Excel, sipped champagne on a private yacht, smoked cuban cigars on the beach at Montauk with Ja Rule,  committed to speak at three major conferences and received invitations to speak at two others. Oh and that new job? It’s at the UN.

About a year ago Exversion was taking off: YC flew as out for a chat, TechCrunch covered us, powerful people in DC invited us to VIP lunches, investors sought us out, but I was tired, in debt, isolated from my friends and family and perpetually broke. It would not be fair to say I was miserable because I loved what I was building, I loved working with my cofounders, and I loved the perks offered to us when everyone assumed our success was inevitable. But my life was in a perpetual state of chaos, as was the life of my team. Handling my stress was one thing– living in the 3rd world without running water or electricity then being homeless in the former Soviet Union does wonders for your crisis management skills let me tell you– but watching my cofounders (people I had bonded with and adored) deteriorate under this same stress was too difficult. That made me miserable.

If anyone else was writing this post, the remaining paragraphs would roll out like this: a little symbolic flagellation, a few trite “lessons learned”, capped off with reaffirming ones vows to the cult of personality that poisons the tech scene. “Despite all that I have suffered I still firmly believe in the effectiveness of this snake oil rich white men have sold me, please don’t shun me for my failure. I still deserve to be among the best and brightest.”

But fuck that. (1)

Don’t Bitch About It, Get the Data

Right now I feel like I’m starring in a modern day adaptation of The Producers, because having a FAILING startup has been much better for my career than a successful one ever was. I’m not joking, this is actually my second time failing and the second time I’ve seen a huge jump in my status and income as a result. And when I look around my community I see the same strange little pattern: the people “succeeding” with their startups look terrible while the people “failing” are flying around the world, eating at good restaurants, making important new friends. I even know a few serial failures. Folks who keep founding startups that never come close to ever raising money or getting significant traction and yet they are nowhere near financially ruined.

And I’ve come to feel that the difference between founders who profit from failure and founders who are crushed by it is how much they buy into the rhetoric of startup Gods (the investors, bloggers, and authors who write long, generic, feel good advice about how startup founders should think/feel/behave ultimately based on assumptions of meritocracy and consumer intelligence that have been so thoroughly debunked by now it’s a wonder anyone can base an argument on them and still be taken seriously). Simply put, in my experience, those who think the path to success is built on doing exactly what startup experts tell you to do tend to get ripped apart. While those who are open to following or dismissing the startup dogma depending on their needs sometimes do better as failures than others do as successes.

But I’m a data person, so observing this in retrospect was not enough. I wanted to figure out a way to demonstrate it with science.

Step One: Let’s piss off Naval Ravikant

A lot of people fail at data science because they forget about the SCIENCE part of it. They treat data like tea leaves, mindlessly throwing a bunch of it into a cup, then staring at the bottom and trying to interpret patterns. The first step in exploring an issue with data is to establish a research question. So what did I want to know?

  • I want to know what positions startup founders are most likely to have before they become founders
  • I want to know what positions startup founders are most likely to have after they are no longer founders
  • I want to know if things like how long their startup lasted, their gender, their age, whether they are technical or not have any influence on the first two questions.

So how do I find that data? First thing I need is employment histories, which can be easily found on LinkedIn … except… well, there are two problems. First, LinkedIn has records on millions of people most of whom will never ever found a company of any kind. The second problem is that LinkedIn really hates it when you to scrape them.

They really really hate it.

Rather than try to write a perfect script to thwart LinkedIn’s anti-scraping methods and filter millions of records down to the population of people I wanted to look at I decided to start with another site first.

AngelList.

Of course AngelList also doesn’t like being scraped, and their markup makes it very difficult to get information if indeed the information is even there to begin with. But people do tend to link their LinkedIn profiles to their AngelList profiles, and by foregoing the need to crawl LinkedIn we eliminate most of the problems harvesting data from LinkedIn.

So here’s what I did: After playing around on AngelList I realized the PEOPLE section had an ajax request that hit the following url: https://angel.co/people/load_more?page=1&per_page=25&skip_loading=true which would return a nice neat chunk of HTML easily parsed by BeautifulSoup. From there I could extract all the links to AngelList profiles and parse the profiles themselves with BeautifulSoup. Essentially all I’m looking for is a link with the class name “fontello-linkedin”. If that exists, the script grabs the location of the LinkedIn profile and downloads the page from LinkedIn servers, saves it to a HTML file on my hard drive and moves on.

The reason for saving profiles rather than parsing them was so I could tweak what information I was extracting as often as I wanted without having to worry about LinkedIn finding out and blocking me.

Everything seemed perfect and simple. Until I crashed AngelList’s servers.

Oops?

For the record … I’m almost 100% sure the downtime AngelList experienced and the running of my deliberately misbehaving bot was a coincidence. I mean… it’s not that complicated of a process, it wasn’t hitting that many pages. But whatever, sorry.

Once AngelList was back online, I tweaked the script to sleep for a random number of seconds (up to a full minute) both to mitigate the small chance that I was overloading AngelList’s servers and to keep LinkedIn from detecting my bot. It made the script very slow, but I just opened up work for my actual job in another window and let it run.

Step Two: Cleaning and Normalizing the Data

I’ve gone into more detail about this in other posts, but once I had the data from LinkedIn I couldn’t really do anything with it until I had normalized it. Particularly the job titles. Let me tell you, people put some ridiculous things down as their official positions. I mean people were identifying themselves as “VP Crystal Ball” and “Wish Granter”. One guy was apparently a janitor before starting his startup. Plus there was every conceivable spelling, abbreviation and capitalization variation of “Cofounder and CEO” imaginable.

And the reality of data is that it’s inherently biased, and gets more biased the more you clean it. So where are the biases with this data? We have a potential sampling error with AngelList, as investors are much more likely to have public profiles on AngelList than small time entrepreneurs. So right away we might have a skew just in the very nature of who is likely to submit their info.

In the course of cleaning I did a lot of guessing. I can use machine learning tools to determine gender with a reasonable degree of certainty, but that’s not fool-proof. I estimated the date of birth by assuming that the earliest date of education was undergrad and that undergrad education starts at about 19. Reasonable, but again not fool-proof.

Step 3: Choosing a Method of Analysis.

Once I had the data I had to figure out a way of tracing paths in and out of founder status and figure out which one were the most frequently travelled. And how to arrange the data so that it could be done easily?

Normally my first instinct would be to graph it. As a first step graphs are pretty nice, they give a clear picture you can show basically anyone to illustrate a relationship. Afterwards you want to dot all your i’s and cross all your t’s with statistical significance, but graphs are a nice way to get your bearings and see if you’re on the right track.

Except when I built my job title classification system I ended up with about 35 titles, which is way too much to graph nicely. If I narrow it down further, I loose nuance and possibly inject more vulnerabilities into the process.

So what to do?

Markov chains!

Markov chains would allow me to look at what the most common transitions between jobs were, and how my demographic criteria affected (or didn’t) the results. In this case I don’t really care so much about the probability of each chain, if I did calculate those values it would be possible to built a hack that analyzed a person’s LinkedIn profile and determined their likely career path in startup land. Which could be neat but is essentially a project for another day. So really what I did is a simplified version of Markov chains reporting only the frequency of each full chain itself.

The most common chains in our sample look like this:

Board Member->Board Member->Board Member
Investor->Investor->Investor
Investor->Board Member->Board Member
Founder->Founder->Founder
Board Member->Board Member->Investor
Board Member->Investor->Board Member
Director->Director->Director
Founder->Board Member->Board Member
Investor->Investor->Board Member
Board Member->Investor->Investor
Board Member->Board Member->Founder

Just on first glance the data suggests that startup land is one giant game of musical chairs, with people already on the top switching off between investing and founding. But remember we already established that our dataset was vulnerable to a sampling error, potentially skewing too heavily on the investor side. These results aren’t necessarily surprising.

Big Data Is Still Bullshit, But Sometimes Size Matters

Everyone thinks the benefit of having lots of data is accuracy, but that’s not true. More data collected in a biased manner is still biased. The main advantage to having more data is the ability to control for these kinds of errors. We have data from about 750 people, split into roughly 4,000 items. For most forms of statistical analysis you really only need about 100 cases, so we have some leeway here.

There are a couple of things we could do to control for a sampling bias. We could take a random sample of our 700+ individuals. We could also engineer a completely unbiased sample by randomly selecting a set number of investors, engineers, designers, product managers, etc.

So let’s see what– if any– skew might exist in our data. I wrote a quick script to count the number of profiles with at least one job title equivalent to either “Investor” or “Board Member”.

384 investors out of 778

Not bad. Much better than I expected.

Step 4: Selecting At Random

Still I really like drawing random samples from my data and running the same analysis a few times. Besides doing it really only requires adding two lines of code to our Markov chain generator. No reason not to.

And our most common chains with a random sampling?

Sample One:

Board Member->Board Member->Board Member
Investor->Investor->Investor
Board Member->Board Member->Investor
Investor->Board Member->Board Member
Board Member->Investor->Board Member

Sample Two:

Board Member->Board Member->Board Member
Investor->Investor->Investor
Founder->Founder->Founder
Founder->Board Member->Board Member
Investor->Board Member->Board Member

Sample Three:
Board Member->Board Member->Board Member
Investor->Board Member->Board Member
Board Member->Board Member->Investor
Founder->Founder->Founder
Investor->Investor->Investor

So let’s have some fun and see what the situation looks like for the rest of us by completely removing all the investors.

Founder->Founder->Founder
Advisor->Advisor->Advisor
Engineer->Founder->Senior Engineer
Founder->Senior Marketing->Founder
CEO->Lead Marketing->Founder

But we started collecting this data in order to look at the most common patterns in an out of Founder status. So let’s tweak our scripts again, bring the investor back in and restrict our chains to only the ones that follow this pattern: ?->Founder->?

Founder->Founder->Founder
Investor->Founder->Founder
Board Member->Founder->Board Member
Board Member->Founder->Founder
Founder->Founder->Investor
Investor->Founder->Investor
Founder->Founder->Engineer
Founder->Founder->Board Member
Investor->Founder->Board Member
Founder->Founder->Manager
Investor->Founder->Advisor
Founder->Founder->Lead Engineer
Investor->Founder->Lead Engineer
Founder->Founder->Senior Marketing
Investor->Founder->Associate

As I continued to pick through the data I saw a lot more downward motion– that is people starting off with one role, founding a company and their next position being lower than what they had before. But with 35 categories there are so many different combinations it becomes hard to really get an overview of that idea.

What I decided to do was build another dictionary that would take those 35 categories and assign them a numerical value between 1-6, one being an entry level (or non-startup) job, 6 being Board Member. Here’s a general idea of what that looked like:

6 – Board Member
5 – Investor
4 – Advisor
3 – Lead Engineer
2 – Senior Engineer
1 – Engineer

Now you may find yourself thinking “But that’s completely arbitrary and not at all objective…” and you’d be right. Welcome to data science 🙂 At some point in every analysis there is some kind of judgement call that is based on a completely subjective assumption. Usually these decisions happen at the collection stage, so they are easy to cover up. If I present data on the habits of doctors, not many people are going to think to ask how I’ve defined “doctor” (medical, PHD?, non-MD medical professionals?) yet some decision had to be made in order to collect the data in the first place.

The main problem with the assumption I’m making using this ranking system is that it does things like put an entry level Engineer roughly equal with an Intern. HackerNews will love this idea I am sure.

Anyway, keeping our filter on the ?->Founder->? pattern this is what things look like with the average time spent at each position and the average time spent as a founder:

chain freq first_avg second_avg third_avg founder_years male female technical non-technical
No Movement 199 3.451843 3.988275 3.577052 3.981381 178 9 93 (47%) 106 (53%)
Downward 229 3.650291 4.185953 3.063683 4.239042 200 10 116 (51%) 113 (49%)
Upward 116 3.186782 4.429598 3.635057 4.320076 104 5 50 (43%) 66 (57%)

Couple of things that surprised me about this:

– Founders who see a career boost post founding actually stay LONGER than other groups. Remember we’re restricting the data to the pattern ?->Founder->? here so it’s not just about founder_years being higher, but the difference in second_average too. This is completely opposite of what I was expecting, but seems to support PG’s “stay alive and get rich” axiom

– I added the percentages to the technical/non-technical columns because I could not believe what I was looking at … WHOA seriously? I would think technical skills would give you a better chance of coming out of the founding experience with a better job. After all aren’t we all saying how badly everyone needs good developers? Ouch.

– ….. there really are no women in this industry. Well, okay that wasn’t actually a surprise… but… man.

Places Where Things Might Be Wrong

A smart reader might be wondering if the data isn’t skewed significantly towards Investors, why do they dominant so many of our chains? Is there something worth reading into in that? Maybe startup land is less meritocracy and more “rich getting richer helping out their friends”?

And while that might actually be true and I’d certainly love to write that blog post, I don’t think this data can support that conclusion. One of the things I noticed while writing the script that parsed LinkedIn files was that investors tended to list all the prominent companies they had invested in under their experience. So in other words the job title “Investor” might be at Company “Bullshit Capital” but it might also be seen listed under “eBay” followed immediately by listings for “Investor – Facebook”, “Investor – Twitter”, etc.

That would naturally make the chain Investor->Investor->Investor (or Board Member->Board Member->Board Member) way more likely than other possibilities.

The other factor is that I lumped ALL types of investing together (angel, VC, seed, etc) where as other roles were split into hierarchies (Engineers, Designers, Marketing People, etc). The rationale for this was twofold: one, it’s not really fair to equate angel investing with entry level investing; two, I was interested in the paths technical and non-technical to founding a company … less interested in the break down of investors in the pool.

Fail Slow and Be Sneaky

My take away from this data is that coming out of starting a company with a better career than you left is about staying alive as long as possible. I’m very curious as to why founders with technical skills appear to be less likely to rebound strong post-startup life. It could be a “Big fish, Little Pond” illusion– after all a lowly “Engineer” at Google might be a lead architect anywhere else– or it could be something else.

What remains to be seen is what keeps a startup alive for a long time? After all it costs nothing at all to add the title “Founder” to a listing on LinkedIn. Creating an AngelList account is free. We tend to assume that startup success is fast, overnight if possible. Lots of users, lots of capital raised, the earlier the better. It’s almost like working for it is a sign of something lacking. If you were smarter you’d have found the lightning in the bottle right away.

But the data suggests that while that might be good for investors, it’s not so good for the founders.

And that’s really what I find so striking about my experiences with “startup success” -vs- “startup failure”. Everything that investors and incubator gatekeepers were telling us we HAD TO DO in order for them to take us seriously left me and my team worse off. I have yet to see anyone present any evidence that quitting your job to live just below the poverty line without medical insurance is more likely to lead to startup success than bootstrapping or side hustling.

It’s worth asking: who benefits from furthering the crash-and-burn methodology? Well … the investors. If you’re bootstrapping you don’t need to raise money until you start to see growth, that gives you leverage and possibly your pick of investors. Investing in your company becomes more expensive. The payoff gets smaller. You might forego incubation altogether.

Investors have less options to choose from, their odds go down. Maybe your odds go up– or not– but their odds definitely go down. That’s not nefarious, that’s just basic common sense. The cheaper you can invest, the more investments you can make, the better you spread out your risks. The cheaper you invest the more money you make when one of those investments hits the better you mitigate your risks.

It all makes perfect sense.

It’s just not a very good system if you happen to be a founder.

—————————————-

Data (2)

Raw job listings

Raw job listings with investors filtered out

Full chains

—————————————-

(1) – You may have noticed that my cursing has increased. That’s because I always used to give my blog posts to Jacek before publishing and he thought my fondness for coarse language was unprofessional … he wasn’t wrong about that. (*waves* Hi Jacek!)

(2) – Will be putting most of the data up on Exversion tomorrow and adding links here as I go. It doesn’t take that long to get it on Exversion but in the middle of cleaning up the final files I run out of API credits for genderize.io so it’s either wait until tomorrow or release the data without the male/female split. LOL