Thursday, November 24, 2016

Think Negative

I was reading Artillery Through the Ages: A Short Illustrated History of Cannon, Emphasizing Types Used in America by Albert Manucy, when I came across the following passage.

There is one apocryphal tale, however, about an experiment with chain shot as anti-personnel missiles: instead of charging a single cannon with the two balls, two guns were used, side by side. The ball in one gun was chained to the ball in the other. The projectiles were to fly forth, stretching the long chain between them, mowing down a sizeable segment of the enemy. Instead, the chain wrapped the gun crews in a murderous embrace; one gun had fired late.

Whether the story is true or not, it teaches an important lesson. When designing a system, don't just think about the happy path. Make sure you think about all of the ways that things could go wrong, or risk being wrapped in the "murderous embrace" of your own design. (This is also known as being "hoist by one's own petard," a rather antiquated phrase, also explained in colorful detail in Artillery Through the Ages.)

Monday, August 22, 2016

Twitter's Favorite Films

If you were on Twitter at all last week, you probably couldn't help but notice a flurry of "Fav7" hashtags trending, including #Fav7Films, #Fav7Books, #Fav7TVShows where people were posting a list of their favorite 7 things from each category.



I thought it would be fun to scrape the data to see what Twitter's favorite films are, and compare it to the top rated films on IMDb and Rotten Tomatoes. Here are the results.

Twitter's Top 25 Films
  1. The Dark Knight (9.0 IMDb, 94% RT)
  2. Pulp Fiction (8.9 IMDb, 94% RT)
  3. The Empire Strikes Back (8.8 IMDb, 94% RT)
  4. Goodfellas (8.7 IMDb, 96% RT)
  5. The Shawshank Redemption (9.3 IMDb, 91% RT)
  6. Fight Club (8.8 IMDb, 79% RT)
  7. The Godfather (9.2 IMDb, 99% RT)
  8. Back to the Future (8.5 IMDb, 96% RT)
  9. Inception (8.8 IMDb, 86% RT)
  10. Jurassic Park (8.1 IMDb, 93% RT)
  11. Forrest Gump (8.8 IMDb, 72% RT)
  12. The Big Lebowski (8.2 IMDb, 81% RT)
  13. Jaws (8.0 IMDb, 97% RT)
  14. Star Wars (8.7 IMDb, 93% RT)
  15. Raiders of the Lost Ark (8.5 IMDb, 94% RT)
  16. The Princess Bride (8.1 IMDb, 97% RT)
  17. Blade Runner (8.2 IMDb, 89% RT)
  18. Alien (8.5 IMDb, 97% RT)
  19. The Departed (8.5 IMDb, 91% RT)
  20. The Matrix (8.7 IMDb, 87% RT)
  21. Interstellar (8.6 IMDb, 71% RT)
  22. Aliens (8.4 IMDb, 98% RT)
  23. Good Will Hunting (8.3 IMDb, 97% RT)
  24. The Shining (8.4 IMDb, 88% RT)
  25. Die Hard (8.2 IMDb, 92% RT)

A few observations:

  • Less than half of the films in Twitter's top 25 are also in IMDb's top 25.
  • The Godfather (1972) is the oldest film on the list, while Interstellar (2014) is the newest.
  • Harrison Ford has starred in the most (4) of the top 25 films, while Stephen Spielberg has directed the most (3).
  • Only three sequels appear in the top 25. For two of those, the original film also appears in the top 25.
  • Action/adventure and science fiction films dominate the list.
  • As popular as they are right now, only one film based on a comic book character is in the top 25 (although it did take the top spot).

The source code

Start by loading the required libraries, including twitteR for accessing the Twitter API, and setting up authentication. You'll need to sign up for free on Twitter Developers to get your own authentication keys and tokens. (If you've never done this before, see Bogdan Rau's Collecting Tweets Using R and the Twitter Search API for a more detailed guide.)

library(dplyr)
library(purrr)
library(twitteR)

# Download cacert file for Windows use.
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

consumer_key <- 'your key'
consumer_secret <- 'your secret'
access_token <- 'your access token'
access_secret <- 'your access secret'
setup_twitter_oauth(consumer_key,
                    consumer_secret,
                    access_token,
                    access_secret)

Next, query the Twitter search API for the "#Fav7Films" hashtag, and initialize a data frame with tweets.

requests <- 1 # keep count of how many requests are sent
num_tweets <- 3000 # number of tweets to fetch per request
delay <- 62.0 # add in a delay so the API doesn't block

fav_film_tweets <- searchTwitter("#Fav7Films\n", n=num_tweets)
Sys.sleep(delay) # be nice to the API
fav_film_df <- tbl_df(map_df(fav_film_tweets, as.data.frame))

fav_film_all <- fav_film_df[fav_film_df$isRetweet == FALSE, ]

Now we want to keep searching in a loop, until we've downloaded all the tweets we're interested in. To do that, we'll keep looping as long as the API returns as many tweets as we told it to. Once it returns fewer tweets, we know it ran out.

while(nrow(fav_film_df) == num_tweets) {
    max_id <- fav_film_df$id[num_tweets]
    requests <- requests + 1
    fav_film_tweets <- searchTwitter("#Fav7Films\n", n=num_tweets, maxID=max_id)
    fav_film_df <- tbl_df(map_df(fav_film_tweets, as.data.frame))
    fav_film_all <- rbind(fav_film_all, fav_film_df[fav_film_df$isRetweet == FALSE, ])

    Sys.sleep(delay) # be nice to the API
}

Note that I added the maxID=max_id parameter to the request. This tells the search API to return tweets older than the previous set of tweets. Also note that I added a delay in the loop. Twitter has set a rate limit on their search API to 15 requests every 15 minutes, so this delay is to avoid being blocked.

That will take a while, but once it's done we'll have over 100,000 tweets, so we want to save them so we don't have to go through all that again. I just saved the whole data frame to an R data blob.

save(fav_film_all, file="Fav7FilmTweets.Rda")

You can download that file from GitHub at Fav7FilmTweets.Rda if you want to follow along from this point, or if you want to do your own analysis on this data set. Just use load("Fav7FilmTweets.Rda") to load the data frame from the file.

Next, we want to remove any retweets or multiple tweets from the same user.

fav_film_all <- fav_film_all[fav_film_all$isRetweet == FALSE, ]
fav_film_all <- fav_film_all[!duplicated(fav_film_all$screenName), ]

Now we can start parsing the lists of film titles from the tweets. Most people formatted their titles on separate lines, so we'll assume that format. Any tweets that don't use that format will just fall to the bottom of the list of films once we rank them.

# remove the hashtag, ignoring case
fav_film_all$text <- gsub("#fav7films", "", fav_film_all$text, ignore.case=TRUE)

# remove numbers from lists
fav_film_all$text <- gsub("\\d\\.|\\)|-", "", fav_film_all$text)

# convert to common case for all tweets
fav_film_all$text <- tolower(fav_film_all$text)

# trim any whitespace left over from earlier steps
fav_film_all$text <- trimws(fav_film_all$text)

At this point, we should have a bunch of lists of seven movie titles. What we want to do next is separate them all out into one large list of titles, count how many times each title appears, then sort the list. We'll also remove "A" and "The" from the beginning of any titles that include them, since many people included them, but many didn't.

titles <- list()
titles <- append(titles, strsplit(fav_film_all$text, split="\n"))
titles <- unlist(titles)

# remove leading 'a' and 'the' from titles
titles <- gsub("^a ", "", titles)
titles <- gsub("^the ", "", titles)

# remove empty titles
titles <- titles[titles != ""]

ranked_titles <- sort(table(titles), decreasing=TRUE)
top_25 <- head(ranked_titles, 25)

That's the final list. There are a lot of other conditioning steps that we could have taken, like looking for common abbreviations or misspellings, but I think this gets us pretty close to an accurate list.

You can view the full R source code that I used to gather and analyze tweets for this project in my Fav7 GitHub repository. Feel free to fork that and use it to analyze other Twitter favorites, and leave me a comment if you do, or if you have any questions.

Monday, August 15, 2016

How to Ask a Question on Stack Overflow, a Minimal Guide

"How do I ask a question on Stack Overflow without having it immediately downvoted and closed?" This question is frequently asked by new users on Meta Stack Overflow, and people still seem to have issues with the official guidelines, so maybe a shorter, to-the-point guide will help.

Assuming your question is on-topic, there are three things you need to include when posting a question on Stack Overflow:

1. What your program is supposed to do
2. Your code
3. What your program is actually doing

If you leave any of those three things out, people will have trouble answering your question, so they'll often downvote it and vote to close it instead of trying to answer it.  (Ok, I admit it. It's me. I'm the one downvoting you.) Let's take a closer look at each one of these things, and maybe I can convince you why each one is important.

What your program is supposed to do

You need to include an explanation of what you are trying to do in your question. Without it, people are not going to be able to start helping you. Don't just assume that people will understand from the title or context what you're trying to do. Spell it out. If you help people understand your problem, they're much more able to help you with a solution.

Your code

Does every question on Stack Overflow require code? No. But the majority of them do. Most often, people are going to need to be able to run your code in order to help you. If you have code, you need to post it. If you're going to write code and you don't know where to start, you need to start before asking about it on Stack Overflow. Don't post a link to your code. Don't post an image of your code. Post your code.

What your program is actually doing

Describe how your program is acting differently than you expect (What your program is supposed to do). Not including this information leaves people guessing about what the problem actually is. Does your code not even compile? Give the wrong output? Print an error message? Lock up your machine and set fire to the building? Each one of these very different behaviors will point an experienced developer to the solution much more quickly than just reading your code. Include any output, including the text of any error messages in your question. Those cryptic error messages might not mean anything to you, but a more experienced programmer can lead you directly to the problem if they know all of the details.

Is that really all I need?

No, not really, but this covers the majority of the questions that I see closed and downvoted on Stack Overflow. If you follow these guidelines, you should have far fewer problems with getting your questions answered. Once you've mastered just asking a question and not getting it closed, you should read some more complete guides on asking great questions, like Jon Skeet's Writing the Perfect Question.


Wednesday, January 29, 2014

Why is [programming language] so...?


I saw an infographic recently called Why is [state] so... and thought it would be interesting to do the same for programming languages. Unfortunately, there's no convenient visualization (like a map) for programming languages, so screen shots will have to do.  Here are a few results for the languages that Google actually recognizes as programming languages.

ASP.NET


C++


C#


Cobol


Erlang


Fortran


Haskell


Java


JavaScript


Lisp


Objective-C


Perl


PHP


Python


Ruby


Smalltalk


VB.NET



As a long-time Java programmer, I'm just pleasantly surprised that "popular" was the top result instead of "slow."

Thursday, December 26, 2013

Creating a Twitter 'Bot on Google App Engine in Python

I've been running a Twitter 'bot from my laptop using Windows Task Scheduler for the past several months, and finally decided that it's time to upload it to a server to run from the cloud. @BountyBot tweets new and interesting bounty questions from Stack Overflow several times per day. By following the instructions below, you can set up your own Twitter 'bot that runs on Google App Engine.

What you'll need to get started:
  1. Python 2.7
  2. A Google App Engine Account
  3. The Google App Engine SDK
  4. A Twitter account (with authentication credentials)
  5. A Python library for accessing the Twitter API
  6. A source of information to tweet about

What is Google App Engine?

Google App Engine is Google's cloud computing platform. It allows you to create web applications and run them on Google's existing infrastructure. GAE supports web applications written in Python, Java, PHP, and the Go programming language.  You can find a lot more information on the Google App Engine page, or keep reading for a quick guide to getting set up and posting a Python app on Google App Engine.

Setting up a Google App Engine Account

You can set up a Google App Engine Account for free.  You're only charged for the resources that your application uses (that is, if you even surpass the free service quotas), so it can be a very good low-cost alternative to traditional web hosting providers that charge a flat monthly rate, particularly for a low-resource application like a Twitter 'bot.

Go to https://accounts.google.com to sign in to the GAE dashboard with your Google account credentials. From there you can create a new application. You'll need to provide an application identifier that will be used in your application's configuration later.

Installing Python 2.7

If you don't already have Python installed, you can just download version 2.7 from the official Python download page. If you already have Python on your machine, the Google App Engine SDK installer (see next section) will check for the correct version for you. If it reports an error, you may need to download the correct version of Python, or re-install Python 2.7 in the default location for your operating system.

Installing the Google App Engine SDK

Google App Engine has its own software development kit (SDK) available for free download that allows you to  quickly get started developing your app. Choose Python from the Downloads page, then download and run the installer for your operating system. This should place a shortcut on your desktop to a program called the Google App Engine Launcher.

Deploying and Testing

Before we get to the Twitter 'bot, let's create a quick test page and deploy it to Google App Engine to make sure everything we've done so far is working. The Google App Engine Hello, World! documentation already shows how to create, test, and deploy an application from the command line. I'm going to show how to do the same simple application from the Google App Engine Launcher desktop program that we downloaded and installed in the last section.

Launch the Google App Engine Launcher program and choose Create New Applicaton... from the File menu.


You can provide a new name, or the same name you provided as an Application Identifier when you created your Google App Engine account earlier. Also provide a parent directory on your system for the project files to be stored, then click the Create button. If you go to the project directory, you'll see that several files were created for you.

  • app.yaml - The configuration file that maps URLs to handler scripts. This file also contains your unique application identifier and a version number that allows you to roll your app back to specific versions from the GAE admin console.
  • favicon.ico - The icon that will be displayed in browser tabs when your app is viewed. The Google App Engine icon is the default.
  • index.yaml - A configuration file that specifies which indexes your app uses in the App Engine datastore. Not used in this application.
  • main.py - A Python script that handles requests using the webapp2 framework.

Select the newly created application in GAE Launcher and click the Run button. If you open the log console you'll be able to see what commands are run.  When the application is running, you'll be able to go to http://localhost:8080/ in your browser and see the program's output. (If instead of a running app you get an error message at this point, you may need to go to Edit > Preferences and set the correct Python path.)

Next, click the Deploy button in GAE Launcher. You'll be prompted for your email address and password, then all of the application files will be uploaded to Google App Engine.  Once this is done, you can visit your application's public URL to view the program's output again. Congratulations, the app is now online!

Getting your Twitter Account Authentication Credentials

Before you can post tweets from your Google App Engine project, you'll need to set up some authentication credentials with your Twitter account.  Sign in to the Twitter Developers page, choose My applications from the menu at the top right, then create a new application.  You'll need to change the application type to Read and Write on the settings tab in order to give the new application access to post tweets to your Twitter account.

You'll need the Consumer key and Consumer secret from the OAuth settings section, and you'll need to create an Access token and Access token secret. Be careful! You want to keep these values secret so that other people can't use them to post status updates to your Twitter account.  I keep them in a separate properties file so that they stay out of my source code, and don't accidentally get published where people can access them. You'll see how these values are used in a later section, when we look at the BountyBot code.

Tweeting in Python

You'll need a Twitter API wrapper in order to post tweets in Python. I used tweepy when creating BountyBot because it makes posting a tweet as simple as possible. Once a status message is composed, posting it on Twitter can be done in four lines of code using tweepy, and three of those are for authentication. It doesn't get much simpler. Conveniently, tweepy is also compatible with Python 2.7.

You can download tweepy by following the instructions on the GitHub project linked above. Since it needs to be uploaded to Google App Engine in order to be used by the web app, I just copied the entire tweepy directory into the project directory for my GAE application.

What does a Twitter 'bot tweet about?

Even if it's just for your own personal amusement, you're going to want to give your 'bot something interesting to tweet about. Fortunately, there are a lot of 'bots already on Twitter to look to for inspiration.  There are 'bots that tweet weather updates, breaking news headlines, stock quotes, the price of Bitcoin, and other seemingly random facts.You're really only limited by your imagination and Twitter's 140-character post limit. Check out sites like ProgrammableWeb, Data.gov, and World Bank for thousands of data sets and APIs to use.

Stack Overflow and Bounties

The Twitter 'bot I'm going to use for this demonstration gets its information from Stack Overflow using the Stack Exchange API. Stack Overflow is a question and answer site for programmers. Professional programmers and students post questions about code that they're writing for other programmers in the community to answer. The best answers get voted up, earning reputation points for the person who posted the answer. If a question doesn't get a good answer for some time, a "bounty" of bonus reputation can be placed on the question by anyone who wants to get an answer (provided they have the extra reputation to spend on a bounty).

Bounties last for seven days, unless the person who placed the bounty awards it early. You can view all of the questions that have open bounties on the featured questions tab.  Since Stack Overflow is a very active site (thousands of questions are posted every day), it sees about 60 bounties posted per day on average. This is why it's convenient to have a Twitter 'bot that posts links to only the most interesting bounty questions (as determined by the amount of the bounty and the number of upvotes the question receives).

All of the questions and answers posted on Stack Overflow are accessible through the Stack Exchange API, including a method for returning information about questions with active bounties. The Python code we'll look at in the next section will call this API method to get all of the bounties posted in the past 8 hours.

Putting it all together

Now that all the pieces are in place, we can see how they all fit together. You can take a look at the full code for BountyBot on GitHub, and I'll explain several key points here.

The tweet_bounty.py file contains all of the updated code for BountyBot to run on Google App Engine. It follows the same basic structure as the "Hello, World!" example that we looked at earlier. The script contains a class named TweetBounty that extends the webapp2.RequestHandler class. The get method of this class is configured to handle requests.

The get method queries the Stack Exchange API for the most recently posted bounties, finds the most interesting bounty in that list, formats it into a 140-character (maximum) message, then posts that message as a status update to Twitter.

  • request_bounties - Requests a list of bounty questions from the Stack Exchange API. The most recent bounties are those that will expire in one week, so the time stamps passed to this method form an eight hour window that ends one week from the current time and date.
  • find_max - Loops through the list of bountied questions and returns the one with the highest bounty amount. Upvotes on the questions are used to break ties.
  • format_status_msg - Takes the maximum bounty question and formats it into a 140-character message for posting to Twitter. (Question title, short link to the question, bounty amount, and most relevant tags that will fit in the 140-character limit.)
  • tweet - Takes the formatted status message and posts it to the Twitter account whose authentication credentials are supplied in the settings.cfg file.
The tweet function is where the magic happens, so it's worth taking a closer look at it here.

# Update the Twitter account authorized
# in settings.cfg with a status message.
def tweet(status):
    config = ConfigParser.RawConfigParser()
    config.read('settings.cfg')
    
    # http://dev.twitter.com/apps/myappid
    CONSUMER_KEY = config.get('Twitter OAuth', 'CONSUMER_KEY')
    CONSUMER_SECRET = config.get('Twitter OAuth', 'CONSUMER_SECRET')
    # http://dev.twitter.com/apps/myappid/my_token
    ACCESS_TOKEN_KEY = config.get('Twitter OAuth', 'ACCESS_TOKEN_KEY')
    ACCESS_TOKEN_SECRET = config.get('Twitter OAuth', 'ACCESS_TOKEN_SECRET')

    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)
    api = tweepy.API(auth)
    result = api.update_status(status)

The tweet function takes in status as an argument and posts it to Twitter. The first two lines of the function read a configuration file that contain the Twitter authentication credentials we set up earlier, and the next four lines read those values from the file. If we were going to reuse these values it would be worth it to pull this part of the code out into a separate function, but since this script only accesses Twitter once, we can do it all in one place.

The last four lines use the authentication credentials loaded from the file to prove to Twitter that the script has permission to post status messages on the associated Twitter account, then updates the status of that account with the status message passed in as an argument. That's all there is to it.

Scheduling tweets with cron

Now that we've got a web app up and running that can post to Twitter, all that's left is to set up a schedule for when those tweets should be posted using cron. You do this on Google App Engine by creating a cron.yaml file that specifies when you want tasks to be executed. Tasks for BountyBot in cron.yaml have the following format:

- description: daily 1PM tweet
  url: /tweet_bounty
  schedule: every day 13:00
  timezone: America/New_York

The first line is just a description and doesn't change the way cron behaves. The second line is the URL of the script that needs to be run on a schedule. It's a relative URL from the base of the application on GAE. The next line tells cron when to run the script, and the last one specifies what timezone that schedule is based in. If you leave out the timezone, GAE will assume UTC. Since I want BountyBot to post tweets three times per day, I have three entries in my cron.yaml file, one each for 5AM, 1PM, and 9PM in my timezone.

Important: Since scheduled tasks usually do things that you only want to be done on a schedule and not by users visiting the URL of the script (like posting status updates to your Twitter account, for one example), it's important to secure those scripts so that they can only be run by a site administrator (you) and the task scheduler (cron). You can secure a script by adding login: admin to its entry in your app.yaml file.

- url: /tweet_bounty
  script: tweet_bounty.app
  login: admin

You can read more about formatting tasks in cron.yaml in the GAE article Scheduled Tasks With Cron for Python.

Finally...

 If you made it this far, congratulations, you're a Google App Engine expert! Just kidding. This article really only scratches the surface when it comes to Google App Engine, but it does serve as a quick start guide that you can use to get something up and running in a weekend. Be sure to visit the Google App Engine developer's guide to find much more in-depth tutorials, sample code, and videos that explain all the features of the platform. If you run into problems, remember that you can always search or ask the experts on Stack Overflow for some help. Good luck!

Tuesday, April 30, 2013

SICP 2.66: Sets and information retrieval

From SICP section 2.3.3 Sets and information retrieval

The final part of section 2.3.3 asks us to consider a database that contains records, each of which has a key and some data. If the database is represented as an unordered list, a record can be looked up by its key using the following procedure.

(define (lookup given-key set-of-records)
  (cond ((null? set-of-records) false)
        ((equal? given-key (key (car set-of-records)))
         (car set-of-records))
        (else (lookup given-key (cdr set-of-records)))))

We can define simple procedures for making a record out of a key and its data, and for extracting the key and data from an existing record in order to test the procedure above.

(define (key record) (car record))
(define (data record) (cdr record))
(define (make-record key data) (cons key data))

(define database
  (list (make-record 1 'Bill)
        (make-record 2 'Joe)
        (make-record 3 'Frank)
        (make-record 4 'John)))
  
> (lookup 3 database)
'(3 . Frank)
> (data (lookup 1 database))
'Bill

Exercise 2.66 asks us to implement the lookup procedure for the case where the set of records is structured as a binary tree, ordered by the numerical values of the keys.

We can start by including the list->tree and partial-tree procedures given for exercise 2.64, along with a few required supporting procedures.

(define (entry tree) (car tree))
(define (left-branch tree) (cadr tree))
(define (right-branch tree) (caddr tree))
(define (make-tree entry left right)
  (list entry left right))
  
(define (list->tree elements)
  (car (partial-tree elements (length elements))))

(define (partial-tree elts n)
  (if (= n 0)
      (cons '() elts)
      (let ((left-size (quotient (- n 1) 2)))
        (let ((left-result (partial-tree elts left-size)))
          (let ((left-tree (car left-result))
                (non-left-elts (cdr left-result))
                (right-size (- n (+ left-size 1))))
            (let ((this-entry (car non-left-elts))
                  (right-result (partial-tree (cdr non-left-elts)
                                              right-size)))
              (let ((right-tree (car right-result))
                    (remaining-elts (cdr right-result)))
                (cons (make-tree this-entry left-tree right-tree)
                      remaining-elts))))))))

This makes it easier to convert the existing database to one structured as a binary tree.

> (define tree-db (list->tree database))
> tree-db
'((2 . Joe) ((1 . Bill) () ()) ((3 . Frank) () ((4 . John) () ())))

Finally, we can write the new implementation of lookup using element-of-set? as a guide.

(define (lookup given-key set-of-records)
  (cond ((null? set-of-records) #f)
        ((= given-key (key (car set-of-records)))
         (car set-of-records))
        ((< given-key (key (car set-of-records)))
         (lookup given-key (left-branch set-of-records)))
        ((> given-key (key (car set-of-records)))
         (lookup given-key (right-branch set-of-records)))))

> (lookup 3 tree-db)
'(3 . Frank)
> (lookup 1 tree-db)
'(1 . Bill)
> (lookup 5 tree-db)
#f
> (data (lookup 2 tree-db))
'Joe

For links to all of the SICP lecture notes and exercises that I've done so far, see The SICP Challenge.