Monday, August 22, 2016

Twitter's Favorite Films

If you were on Twitter at all last week, you probably couldn't help but notice a flurry of "Fav7" hashtags trending, including #Fav7Films, #Fav7Books, #Fav7TVShows where people were posting a list of their favorite 7 things from each category.

I thought it would be fun to scrape the data to see what Twitter's favorite films are, and compare it to the top rated films on IMDb and Rotten Tomatoes. Here are the results.

Twitter's Top 25 Films
  1. The Dark Knight (9.0 IMDb, 94% RT)
  2. Pulp Fiction (8.9 IMDb, 94% RT)
  3. The Empire Strikes Back (8.8 IMDb, 94% RT)
  4. Goodfellas (8.7 IMDb, 96% RT)
  5. The Shawshank Redemption (9.3 IMDb, 91% RT)
  6. Fight Club (8.8 IMDb, 79% RT)
  7. The Godfather (9.2 IMDb, 99% RT)
  8. Back to the Future (8.5 IMDb, 96% RT)
  9. Inception (8.8 IMDb, 86% RT)
  10. Jurassic Park (8.1 IMDb, 93% RT)
  11. Forrest Gump (8.8 IMDb, 72% RT)
  12. The Big Lebowski (8.2 IMDb, 81% RT)
  13. Jaws (8.0 IMDb, 97% RT)
  14. Star Wars (8.7 IMDb, 93% RT)
  15. Raiders of the Lost Ark (8.5 IMDb, 94% RT)
  16. The Princess Bride (8.1 IMDb, 97% RT)
  17. Blade Runner (8.2 IMDb, 89% RT)
  18. Alien (8.5 IMDb, 97% RT)
  19. The Departed (8.5 IMDb, 91% RT)
  20. The Matrix (8.7 IMDb, 87% RT)
  21. Interstellar (8.6 IMDb, 71% RT)
  22. Aliens (8.4 IMDb, 98% RT)
  23. Good Will Hunting (8.3 IMDb, 97% RT)
  24. The Shining (8.4 IMDb, 88% RT)
  25. Die Hard (8.2 IMDb, 92% RT)

A few observations:

  • Less than half of the films in Twitter's top 25 are also in IMDb's top 25.
  • The Godfather (1972) is the oldest film on the list, while Interstellar (2014) is the newest.
  • Harrison Ford has starred in the most (4) of the top 25 films, while Stephen Spielberg has directed the most (3).
  • Only three sequels appear in the top 25. For two of those, the original film also appears in the top 25.
  • Action/adventure and science fiction films dominate the list.
  • As popular as they are right now, only one film based on a comic book character is in the top 25 (although it did take the top spot).

The source code

Start by loading the required libraries, including twitteR for accessing the Twitter API, and setting up authentication. You'll need to sign up for free on Twitter Developers to get your own authentication keys and tokens. (If you've never done this before, see Bogdan Rau's Collecting Tweets Using R and the Twitter Search API for a more detailed guide.)


# Download cacert file for Windows use.
download.file(url="", destfile="cacert.pem")

consumer_key <- 'your key'
consumer_secret <- 'your secret'
access_token <- 'your access token'
access_secret <- 'your access secret'

Next, query the Twitter search API for the "#Fav7Films" hashtag, and initialize a data frame with tweets.

requests <- 1 # keep count of how many requests are sent
num_tweets <- 3000 # number of tweets to fetch per request
delay <- 62.0 # add in a delay so the API doesn't block

fav_film_tweets <- searchTwitter("#Fav7Films\n", n=num_tweets)
Sys.sleep(delay) # be nice to the API
fav_film_df <- tbl_df(map_df(fav_film_tweets,

fav_film_all <- fav_film_df[fav_film_df$isRetweet == FALSE, ]

Now we want to keep searching in a loop, until we've downloaded all the tweets we're interested in. To do that, we'll keep looping as long as the API returns as many tweets as we told it to. Once it returns fewer tweets, we know it ran out.

while(nrow(fav_film_df) == num_tweets) {
    max_id <- fav_film_df$id[num_tweets]
    requests <- requests + 1
    fav_film_tweets <- searchTwitter("#Fav7Films\n", n=num_tweets, maxID=max_id)
    fav_film_df <- tbl_df(map_df(fav_film_tweets,
    fav_film_all <- rbind(fav_film_all, fav_film_df[fav_film_df$isRetweet == FALSE, ])

    Sys.sleep(delay) # be nice to the API

Note that I added the maxID=max_id parameter to the request. This tells the search API to return tweets older than the previous set of tweets. Also note that I added a delay in the loop. Twitter has set a rate limit on their search API to 15 requests every 15 minutes, so this delay is to avoid being blocked.

That will take a while, but once it's done we'll have over 100,000 tweets, so we want to save them so we don't have to go through all that again. I just saved the whole data frame to an R data blob.

save(fav_film_all, file="Fav7FilmTweets.Rda")

You can download that file from GitHub at Fav7FilmTweets.Rda if you want to follow along from this point, or if you want to do your own analysis on this data set. Just use load("Fav7FilmTweets.Rda") to load the data frame from the file.

Next, we want to remove any retweets or multiple tweets from the same user.

fav_film_all <- fav_film_all[fav_film_all$isRetweet == FALSE, ]
fav_film_all <- fav_film_all[!duplicated(fav_film_all$screenName), ]

Now we can start parsing the lists of film titles from the tweets. Most people formatted their titles on separate lines, so we'll assume that format. Any tweets that don't use that format will just fall to the bottom of the list of films once we rank them.

# remove the hashtag, ignoring case
fav_film_all$text <- gsub("#fav7films", "", fav_film_all$text,

# remove numbers from lists
fav_film_all$text <- gsub("\\d\\.|\\)|-", "", fav_film_all$text)

# convert to common case for all tweets
fav_film_all$text <- tolower(fav_film_all$text)

# trim any whitespace left over from earlier steps
fav_film_all$text <- trimws(fav_film_all$text)

At this point, we should have a bunch of lists of seven movie titles. What we want to do next is separate them all out into one large list of titles, count how many times each title appears, then sort the list. We'll also remove "A" and "The" from the beginning of any titles that include them, since many people included them, but many didn't.

titles <- list()
titles <- append(titles, strsplit(fav_film_all$text, split="\n"))
titles <- unlist(titles)

# remove leading 'a' and 'the' from titles
titles <- gsub("^a ", "", titles)
titles <- gsub("^the ", "", titles)

# remove empty titles
titles <- titles[titles != ""]

ranked_titles <- sort(table(titles), decreasing=TRUE)
top_25 <- head(ranked_titles, 25)

That's the final list. There are a lot of other conditioning steps that we could have taken, like looking for common abbreviations or misspellings, but I think this gets us pretty close to an accurate list.

You can view the full R source code that I used to gather and analyze tweets for this project in my Fav7 GitHub repository. Feel free to fork that and use it to analyze other Twitter favorites, and leave me a comment if you do, or if you have any questions.

Monday, August 15, 2016

How to Ask a Question on Stack Overflow, a Minimal Guide

"How do I ask a question on Stack Overflow without having it immediately downvoted and closed?" This question is frequently asked by new users on Meta Stack Overflow, and people still seem to have issues with the official guidelines, so maybe a shorter, to-the-point guide will help.

Assuming your question is on-topic, there are three things you need to include when posting a question on Stack Overflow:

1. What your program is supposed to do
2. Your code
3. What your program is actually doing

If you leave any of those three things out, people will have trouble answering your question, so they'll often downvote it and vote to close it instead of trying to answer it.  (Ok, I admit it. It's me. I'm the one downvoting you.) Let's take a closer look at each one of these things, and maybe I can convince you why each one is important.

What your program is supposed to do

You need to include an explanation of what you are trying to do in your question. Without it, people are not going to be able to start helping you. Don't just assume that people will understand from the title or context what you're trying to do. Spell it out. If you help people understand your problem, they're much more able to help you with a solution.

Your code

Does every question on Stack Overflow require code? No. But the majority of them do. Most often, people are going to need to be able to run your code in order to help you. If you have code, you need to post it. If you're going to write code and you don't know where to start, you need to start before asking about it on Stack Overflow. Don't post a link to your code. Don't post an image of your code. Post your code.

What your program is actually doing

Describe how your program is acting differently than you expect (What your program is supposed to do). Not including this information leaves people guessing about what the problem actually is. Does your code not even compile? Give the wrong output? Print an error message? Lock up your machine and set fire to the building? Each one of these very different behaviors will point an experienced developer to the solution much more quickly than just reading your code. Include any output, including the text of any error messages in your question. Those cryptic error messages might not mean anything to you, but a more experienced programmer can lead you directly to the problem if they know all of the details.

Is that really all I need?

No, not really, but this covers the majority of the questions that I see closed and downvoted on Stack Overflow. If you follow these guidelines, you should have far fewer problems with getting your questions answered. Once you've mastered just asking a question and not getting it closed, you should read some more complete guides on asking great questions, like Jon Skeet's Writing the Perfect Question.