Reverse Engineering OKCupid

If you’ve ever seen bots on reddit like qkme_transcriber, you may wonder how they’re able to make posts without human intervention. Well wonder no longer! By examining how OKCupid handles HTTP requests, I’ll show you how to first deconstruct the undocumented APIs of other sites, then manipulate them with nary a mouse click.

Our weapons of choice will be Python and the debugger in the Chrome web browser. To access the Chrome Debugger, hit F12 or Ctrl+Shift+J. This tool gives you a wealth of information about the interactions between your browser and the sites you visit, but for this task we’re mainly concerned with what’s going on in the Network tab. If you’re more of a Firefox person, you can use an extension called Firebug. Incidentally, the lead developer of Firebug was hired by Google to develop the Chrome Debugger. For this post I’m going to assume you’re using Chrome.

Since our goal is to programmatically perform actions on the site, you need to make an OKC account (I recommend using a temporary email service like this one to avoid spam). The next step is to figure out how to log in using Python. So, turn on the Chrome Debugger, go to the Network tab, log out of OKC, and log in again. You’ll see a bunch of stuff appear in the debugger, but what we want is all the way at the top, so scroll up there and click the the link named “login”. That will bring up a bunch of helpful information about the request we made to OKC’s servers when we clicked “Sign in”.

OKCupid Login in Chrome Debugger

Mmm, data

At the top, we can see the Request URL: https://www.okcupid.com/login. Also useful is the Request Method, POST. Under the Request Headers section, there’s a field called “Cookie” which has a bunch of random-looking stuff in it. This tells us that OKC sets a cookie on your browser, so we’ll need to make sure our code uses it. Finally, there’s another section called “Form Data” containing all of the information that our computer sent to the server when we signed in. Specifically, we can see our login credentials, “username” and “password”, along with something called “dest” that has the value “/home”. Looking at our browser’s URL bar, we can deduce that the “dest” parameter denotes the path to the resource in the okcupid domain to which our browser will be redirected after the login request is completed.

So how do we log in using Python? It’s easy using the Requests library:

import requests

sesh = requests.Session()

credentials = {'username': 'okc_account_name', 'password': 'account_pword', 'dest': '/home'}
login_resp = sesh.post('https://www.okcupid.com/login', data=credentials)

print login_resp.content

We start by instantiating a Session object which will store the cookies necessary for making future requests to OKC that require authentication. Then, we create a dictionary containing the POST data needed for the login request (fill in your own username and password). Last, we make the POST request from within the Session, passing in the login URL and the credentials. If you store the output of print login_resp.content in an HTML file and open it in your browser, you’ll see the OKC welcome page for your account. Success!

Now let’s try something a bit trickier: sending a message. Go back to OKCupid with the debugger open and click the Messages tab on the left toolbar. In the Network section of the debugger there will be a request called “messages”. If you look at the Headers of that request, you’ll see that the Request URL is http://www.okcupid.com/messages and the Request Method is GET. Assuming you’ve run the login code aove, we can replicate this in Python as follows:

get_messages = sesh.get('http://www.okcupid.com/messages')

If you haven’t deleted it, there should be a message from the user “chriscoyne” in your Received Messages. Since that account apparently doesn’t respond to messages, it’s a good test subject for cracking the OKC API. If you click on his message, you’ll get another request in your debugger that looks like this:

Hi Chris!

Hi Chris!

The text following the question mark after “messages” is called a query string, and it allows you to send additional information to an HTTP server. We can set these values in Python in a way similar to our POST request above:

thread_data = {'readmsg': 'true', 'threadid': 7054982370615456828, 'folder': 1}
get_thread = sesh.get('http://www.okcupid.com/messages', params=thread_data)

If you want to avoid manually determining the thread IDs for OKCupid messages, there’s a great HTML parsing library called Beautiful Soup. I’ll talk more about scraping web pages in a future post.

Now, send a message to chriscoyne and check out what the debugger gives you.

I swear

I swear

So, we have to make another POST request. But you know what to do!

msg_data = {'ajax': 1, 'sendmsg': 1, 'r1': 'chriscoyne', 'body': 'messager? but i hardly know her!',
'threadid': 7054982370615456828,
'authcode': '1,0,1362399999,0xdb8d1bb0ffffffff;aa5242512122943168ddffffffffffffffffffff', 'reply': 1}
send_msg = sesh.post('http://www.okcupid.com/mailbox', data=msg_data)

If you go back to the message thread in your browser, you’ll see the message you just sent from your script. To avoid having to send a test message to figure out what the value for authcode should be, you can right click on the message box in Chrome and select “Inspect element” from the drop-down menu. You should see the authcode a few HTML elements above the element for the message box:

Not so hidden eh?

Not so hidden eh?

OKC seems to change the authcode at certain intervals as a security measure, so in a production system you’d need to dynamically grab the value before you send a message (which you can do with Beautiful Soup).

So there you have it! Hopefully this post has helped you get started reverse engineering other website APIs. If you need help, don’t hesitate to leave a comment.

18 thoughts on “Reverse Engineering OKCupid

  1. Please consider using lxml instead of Beautiful Soup. lxml uses the Python-standard etree API and has lots of other goodies like xpath which is a huge boon to scraping webpages.

  2. I’m a fan of burp and pyquery for doing a quick job of it. Burp helps map the entire api way faster and pyquery searching is really handy.

  3. I second Linus’s point about using lxml over Beautiful Soup.

    Beautiful Soup hasn’t been updated for quite some time, it’s an abandoned project. I would love to see this blog article updated with lxml.

    • You call 6 months since the last release “quite some time” and “abandoned’? What interesting definitions you have.

  4. I third Dennis and Linus’ recommendation to use lxml. If you can’t get away with straight up regex, lxml with save you some CPU cycles.

  5. It’s an honor to be arguing with Linus himself :)
    I’ve used both BeautifulSoup and lxml, and I’d say that while lxml is probably the preferred solution in production with its speed, clean interface, and built-in xpath support, BeautifulSoup can still be useful for a quick hack or dealing with severely broken HTML.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>