About a month ago, an alert came across my desk (well… metaphoric desk anyway): the Centers for Medicare & Medicaid Services had released updated data downloads for their Open Payments program. When I followed the link through to check it out the following warning greeted me:
Some datasets, particularly the general payments dataset included in the zip file containing identifying information, are extremely large and may be burdensome to download and/or cause computer performance issues. […] Be advised that the file size, once downloaded, may still be prohibitive if you are not using a robust data viewing application. Microsoft Excel has limitations on the number of records it can display, which this file exceeds.
Indeed some of CMS files are as much as a GB of data. And here I thought “Hey, I have a company for this” (so yeah, if you want to poke through the CMS Open Payments data all of it is on Exversion right here)
APIs are nice 🙂
That being said, one wonders exactly what you can do with Open Payments data. It’s natural to look at the words Medicare/Medicaid and assume these are all medical bills, but actually it’s a lot more interesting than that: 
This data lists consulting fees, research grants, travel reimbursements, and other gifts the health care industry – such as medical device manufacturers and pharmaceutical companies – provided to physicians and teaching hospitals.
Well now that sounds pretty nefarious. I mean come on, we all know that the money moved around through gifts and grants influences the type of treatments doctors recommend. So now the government is giving you an opportunity to look directly at that activity.
The fact that they took something really interesting and wrapped it up in the most uninteresting way possible is to be expected. It’s a government thing.
Identified -vs- Deidentified Datasets
If you check out our collection of CMS data the first thing you’ll notice is that each data type is split into two different sets: identified and deidentified datasets. This is no–, as I first assumed– the same data with identifying information removed (I admit that this wouldn’t actually make any sense to begin with but in my defense I’ve seen the government do MUCH worse with their open data). Instead the de-identified is a collection of cases where some of the necessary data about who received what is missing or ambiguous.
Otherwise what the CMS released fits three categories:
- General Payments: Payments or other transfers of value not made in connection with a research agreement or research protocol.
- Research Payments: Payments or other transfers of value made in connection with a research agreement or research protocol.
- Physician Ownership Information: Information about physicians who have an ownership or investment interest in an applicable manufacturer or GPO.
Looking At the Data: Who Get the Most Research Dollars?
Essentially what CMS has released is just a dump of their database. Each files has what feels like twenty or more columns, most of which have no information in them. The benefit of accessing this data through an API as opposed to downloading the file and trying to work with that is that we can segment the amount of data we’re looking at before committing any computer memory to the task.
The first thing we did was rearrange Research Payments to look at how much money each state received for the year 2013. Because this is a smaller dataset, we wrote a python script to iterate through each page of data returned by the API, sort through and rearrange as needed. This is not recommended for super large datasets as you will hit our api’s rate limit pretty quickly, but for this size it wasn’t an issue. We used python to write nice clean json we could copy and paste into d3.js and create an interactive map (click through to see)
Along the way we discovered something funny. All the payment data was between the months of August and December. After a little research we discovered that this is a relatively new thing for CMS. The mandate to release this information was part of the Affordable Care Act. As 2013 is the first year, they could not collect a full year’s worth of data.
That means 2014’s files will be EVEN LARGER.
Will Doctor For Food
Anyway, we wanted to poke around this General Payments file, it seemed like the most interesting stuff would be there. But the identified version is over a GB… kind of unpalatable.
Luckily with Exversion we can take a sample and play around with that instead. How about 50,000 records? Fetching 50,000 records and analyzing them took seconds. All we had to do is add the ‘_limit’ parameter to our request:
I bet if I told you Big Pharma was paying physicians and teaching hospitals in FOOD you wouldn’t believe me, but here’s the breakdown of that 50,000 record sample:
|Type of Payment||Number of Payments||Total Amount|
|Food and Beverage||40302||$1,037,941.01|
|Travel and Lodging||2710||$862,276.81|
|Compensation for serving as faculty||90||$194,884.57|
|Royalty or License||100||$6,913,971.38|
|Current or prospective ownership or investment interest||4||$529,830.08|
|Compensation other services||2787||$3,256,475.94|
The interesting thing here is that in our sample the vast majority of payments are tiny amounts related to wining and dining doctors and hospitals, but that does not add up to the most money spent. No, much more money is spent in royalties and granst, but to only a handful of institutions.
So there you go. Now you can play around with CMS’s Open Payments data without worrying about choking your computer. Can’t wait to see what the rest of the internet does with this.