Dealing with Complex Data on Exversion

complex

Handling multidimensional or nestled data structures was a big part of enabling GeoJSON support for Exversion. The traditional file upload only accepts CSV files, which is fine when your data is flat simple columns and rows, like a basic spreadsheet.

But data within data can’t be processed the same way, and each level of complexity opens up new challenges with file formats, standardization and normalization. It seemed like it would be months before we could offer support for that type of schema.

And then we built the API and it suddenly became super easy.

Step One: Authenticating Without Writing Any Code At All

When it comes to publishing data, a lot of the people we work with are not very technical. They know how to use python, R, or MatLab, but they’re not prepared to develop a whole OAuth client. And yet, the security OAuth offers is essential. We needed to figure out a way to serve these people, while still protecting their data.

The solution was building an OAuth client directly into Exversion specifically for these use cases. If you only want access to your own account, and only your own account, click a few buttons and Exversion will issue a valid access token to sign your requests with.

Obviously if you need to access multiple account, the best solution is to build your own client. Sharing access tokens among multiple parties or multiple applications negates their security benefits, but in cases where you only want to bypass the restrictions of the web upload system and get your data up on your account this is a perfectly acceptable work around.

Step Two: Create a Dataset

Now that you have a valid access token, you have two options to create a dataset. The first is to simply do it through the website, the second is to send a POST request with the same information to the API.

An example in Python (using the Requests module which you will need to install if you don’t have it already):

import requests

url = 'https://www.exversion.com/api/v1/dataset/create'

payload = {"access_token":"YmQ1...", "name":"api create", "description":"this was created with the api","source_url":"http://www.exversion.com", "org":0,"source_author":"Me","source_date":"August 7, 2013", "source_contact":"info@exversion.com", "private":0}

r = requests.post(url, data=json.dumps(payload))

print r.json()

The API will return a JSON response containing a string of numbers and letters that serve as an id for the dataset.

{"status":200,"message":"Success","body":[{"dataset":"EV5XF25RH3MIMPP","sourceURL":"http:\/\/www.exversion.com","sourceAuthor":"Me","sourceDate":"August 7, 2013","sourceContact":"info@exversion.com","uploadedBy":{"id":"1","name":"bellmar"},"description":"this was created with the api","heritage":[],"size":0,"forkchanges":null,"columns":[""]}]}

You can also find the dataset id by pulling up the dataset on Exversion. It’s in the URL:

https://exversion.com/data/view/BVVYOK9DUITSH36/united-states-population-by-state-and-age

Copy it down because you’ll need that ID in the next step.

Step Three: Push Your Data to Exversion

Now that we have a place to put the data, uploading it into the system is just another POST request.

That looks like this:

import requests

mydata = [{"name":"Susan","gender":"f", "age":30},

{"name":"Steve","gender":"m", "age":35},

{"name":"Frank","gender":"m", "age":28}

]

url = 'https://www.exversion.com/api/v1/dataset/push'

payload = {"access_token":"YmQ1...", "dataset":"ZV0C2R4E2MF8LBX", "data":mydata}

r = requests.post(url, data=json.dumps(payload))

If you can load your data into python, you can push it to Exversion. For best results you may want to break the data up into multiple requests. The larger your request the more likely it will timeout. How many items should be in each request really depends on the size of each individual item. The more columns your data has the few number of rows should be in each request.

Here’s a simple script that uses the ijson module to stream a large file and push it to Exversion in chunks of twenty rows at a time:

Another thing to be aware of is that Exversion will define the schema of your dataset based on the first row of data submitted to it. Which means if the first row is missing some columns, Exversion will reject any data submitted afterwards with values in those columns. So make sure that the first rows of your dataset are complete before pushing.

Step Four: Querying Multidimensionally

Through the API nestled json objects are easily converted into queriable data. All we need to do to have Exversion filter data by columns inside columns is specify the levels. Instead of:

{"key":"[YOUR KEY]","merge":0, "query":[{"dataset":"MXZHQGZSVH8484K", "params":{"state":"CA"}, "_limit":3}]}

We would run…

{"key":"[YOUR KEY]","merge":0, "query":[{"dataset":"MXZHQGZSVH8484K", "params":{"location.city":"Oakland"}, "_limit":3}]}

To get at the city subcolumn within the location column.

Caveats and Gotchas

If you’re using multidimensional querying to access GeoJSON information, remember that the API automatically paginates at 250 rows of data. You can read more about adjusting how many rows are returned at once in our API documentation.

Advertisements

One thought on “Dealing with Complex Data on Exversion

  1. Pingback: Version Control for Data | Happy Endpoints

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s