Thursday, September 1, 2011

Example #16 - Complex query using inline filter

Example #15 showed how to use a filtered view to do a query.  There is a second approach:  the query is in your code.  There is no filtered view stored at the data portal.  Your code creates a temporary filter and use it to get your results.  This is called an inline filter.

We'll start with an example to find all payments for consulting services made by Cook County in December 2010.  Our filter will search the Product Desritpion NIGP Code (sic) column for all rows equal to 918, the code for consulting services.

Get the IDs
You'll need the ID of the data set you wish to search.  This will be the originalViewID.  You'll also need the columnID for the column(s) in your query.  There are a couple ways to get that info.

Use your browser
You can quickly find these values using your web browser.  Open the Chkregist DEC2010 dataset in your browser, click on the Export button, and click on the API tab.  You'll find the originalViewID in the API Access Endpoint URL near the end of the string.  You can find the columnIDs a little farther down in the tab. Here's a screenshot.


Use code
Another way to get the ID for the column is to write code to dig them out of the view's metadata.  Here's a small program to print out the column names and column IDs for the Chkregist DEC2010 dataset on the Cook County Data Portal.

from urllib2 import urlopen
from json import load


url = 'http://datacatalog.cookcountyil.gov/api/views/e9qr-rmq4/rows.json'
u = urlopen(url)
response = load(u)


for columnMeta in response['meta']['view']['columns']:
  print columnMeta['id'], ' : ', columnMeta['name']


Here's the output.  For each row in the table, you'll see the column ID and the column name.  The first eight columns in the view are metadata.  They have an ID of -1.


From the browser, the originalViewID is e9qr-rmq4.  From the browser or the code, the Product Code columnID is 2700017.

Creating the query
I used the Socrata example about halfway down on this page as a guide to create the query.  There's very little documentation that I could find on how to write an inline filter.  This example seems to be the simplest possible.  It using a prefix operator with two operands to request rows that have the value 918 in the column holding the Product Codes.  Here's the Python snippet:

query      = {
             "originalViewId" : "e9qr-rmq4",
             "name" : "Inline Filter",
             "query" : {
                       "filterCondition" : {
                                           "type" : "operator",
                                           "value" : "EQUALS",
                                           "children" : [
                                                            { 
                                                            "columnId" : 2700017,
                                                            "type" : "column"
                                                            },
                                                            {
                                                            "type" : "literal",
                                                            "value" : "918"
                                                            }
                                                        ]
                                            }
                       }
             }

Example program with an inline filter
Here's a program to run the inline filter.  For each row with consulting services, it prints out a separator with the count along with the column name and their values.  This was lifted from a previous example.  Currently,
the filter produces 47 rows. Translation: the county wrote 47 checks in December 2010 for consulting services.

import httplib
import json
import pprint


hostName   = "datacatalog.cookcountyil.gov"
service    = "/views/INLINE/rows"
formatType = "json"
parameters = "method=index"
headers    = { "Content-type:" : "application/json" }
query      = {
             "originalViewId" : "e9qr-rmq4",
             "name" : "Inline Filter",
             "query" : {
                       "filterCondition" : {
                                           "type" : "operator",
                                           "value" : "EQUALS",
                                           "children" : [
                                                            { 
                                                            "columnId" : 2700017,
                                                            "type" : "column"
                                                            },
                                                            {
                                                            "type" : "literal",
                                                            "value" : "918"
                                                            }
                                                        ]
                                            }
                       }
             }


jsonQuery = json.dumps(query)
request = service + '.' + formatType + '?' + parameters 


conn = httplib.HTTPConnection(hostName)
conn.request("POST", request, jsonQuery, headers)
response = conn.getresponse()


if response.reason != 'OK':
print "There was an error detected."
print "Response status = %s.\n" % response.status
print "Response reason = %s.\n" % response.reason
raise SystemExit(1)


rawResponse = response.read()
jsonResponse = json.loads(rawResponse)


rowNum = 1
for rowData in jsonResponse['data']:
   print "\n======== %d =======" % (rowNum)
   colNum = 0
   for columnMeta in jsonResponse['meta']['view']['columns']:
      if columnMeta['id'] > 0:
         print columnMeta['name'], ' = ', rowData[colNum]
      colNum += 1
   rowNum += 1

Coding the filter
Asking the API to perform an inline filter is quite a bit different from the past API requests.  You must:
  • Use an HTTP POST.  All previous examples were HTTP GET.
  • Send an HTTP header with a specific key/value pair.
  • Send the query as a JSON or XML document 
urllib2.urlopen() doesn't give us the flexibility we need.  There's another library, httplib, that has a class, HTTPConnection(), which works nicely.  Details can be found in the Python documentation.

Comments on the code
  • json.dumps() does the work of turning query into its equivalent JSON format which the API will accept.
  • request is just the concatenation of the pieces that make up the API request.  I kept those components as individual variables to make it easier to change or have a fancier program modify.
  • HTTPConnection() sets up the connection to the portal.  This has to be in place before we make a request().
  • conn.request() sends the API request.  We tell it to use a POST.  Send the API request string we've built.  Send the JSON-formatted query.  Send the right headers.  Use the key/value pair shown.
  • conn.getresponse() provides a handle to the response from the API service.  We use that later to read() the response - get all the data sent back from the API.
  • response.reason is returned by the API server and will indicate if there were problems.
  • loads() turns the what the API sent back into JSON for easier processing.
  • the last section prints out the rows that matched the filter.  This code is almost identical to a script used in a previous example.
Output
Currently, 47 rows match the filter and are printed.  Here's the output of the first four:


Example program with a more complex filter
Here's the same program with a more complex query.  This time we'll add the condition that the payment needs to be greater than $10,000.  So, our filter is going to look for rows where Product Code = 918 and Payment > 10000.

Here's the filter.  The rest of the code is the same.

query      = {
             "originalViewId" : "e9qr-rmq4",
             "name" : "Inline Filter",
             "query" : {
                 "filterCondition" : {
                     "type" : "operator",
                     "value" : "AND",
    "children": [ {
                         "type" : "operator",
                         "value" : "EQUALS",
                         "children" : [ { 
                             "columnId" : 2700017,
                             "type" : "column"
                             }, {
                             "type" : "literal",
                             "value" : "918"
                             } ]
                         }, {
                         "type" : "operator",
                         "value" : "GREATER_THAN",
                         "children" : [ { 
                             "columnId" : 2700027,
                             "type" : "column"
                             }, {
                             "type" : "literal",
                             "value" : "10000"
                             } ]
                         } ]
                     }
                 }
             }

Output
Here are the first four rows that match.  There are total of 18 rows that match this filter.


Alter the filter with your code
The above examples all used a fixed inline filter.  You hard-coded the query and ran it.  Most apps will need to let the user set the query parameters.  Let's peek at how your app could adjust the query based on what the user enters.

You could have an app that asks the user for the payment threshold for consulting services.  Let's say they enter, $60,000.  You could then alter the filter to look for payments greater than that threshold and then send the filter in a request to the API.

These two lines demonstrate how you could make this work.  Place these two lines after query is defined and before jsonQuery is set.

paymentThreshold = "60000"
query['query']['filterCondition']['children'][1]['children'][1]['value'] = paymentThreshold


I've set the value for the paymentThreshold.  In an app, you would get that from the user.  The important line is the second one.  It shows how you can set a new value to use in the test for the payment amount.

Output
When I run that code, here's the output.  There are only five rows with payments over $60,000 for consulting services.


Compare to web UI filter
Something looks fishy here.  The first three rows are identical.  The last two are identical.  I dug into the view and those five rows are all in the table.  I also set up the same filter using the web UI to doublecheck.  Yep, those rows all appear in the data.  It's not the code going awry.

Here's a screenshot with the results of the same filter using the web browser.  The same five rows are displayed.



Summary
If your app will use the API to query data sets and you want to dynamically set the parameters of the filter, you will need to use an inline filter.  There were two examples of filters provided.  The first had a simple condition - find all rows with a specific product code.  The second example was a little more complicated - find all rows with a specific product code and had a payment amount above a given threshold.  Finally, a third example showed the line of code that could be used to adjust the threshold for the payment amount on the fly.

There's little documentation on inline filters.  The SODA site recommends creating a filter using the web UI and then looking at the query that was generated.  You can write a small program to get the filtered view you create.  Then, you look through it to find and study the query.  I did this for several to get a better understanding of the operators and structure of a filter.  It's not easy.  Making it worse, the UI-generated queries throw in seemingly superfluous operators and operands.  Without documentation, it's trial and error to get a filter working.

No comments:

Post a Comment