Social Media Data Mining with Raspberry Pi: 9 Videos for the Complete Beginner

Since the start of this year, I’ve been working on a project to take a $30 Raspberry Pi 2 computer turn it to create a social media data mining machine using the programming language Python. The words “programming language” may be off-putting, but my goal is to work through the process step-by-step so that even a complete beginner can follow along and accomplish the feat.

The inexpensive, adaptable $30 Raspberry Pi 2I’m motivated by two impulses. My first impulse to help people gain control over and ownership of the information regarding interaction that surrounds us. My second impulse is to demonstrate that mastery of social media information is not limited to the corporate, the government, or the otherwise well-funded sphere. This is not a video series for those who already are technologically wealthy and adept. It’s for anyone who has $30 to spare, a willingness to tinker, but the feeling that they’ve been left out of the social media data race. I hope to make the point that anyone can use social media data mining to find out who’s talking to whom. The powers that be are already watching down at us: my hope is that we little folks can start to watch up.

I’m starting the project by shooting videos. The video series has further potential, but has proceeded far enough along to represent a fairly good arc of skill development. Eventually I’d like to transcribe the videos and create a written and illustrated how-to pamphlet; these videos are just the start.

Throughout the videos, I’ve tried not to cover up the temporary mistakes, detours and puzzling bugs that are typical of programming. No one I know of hooks up the perfect computer system or writes a perfect program on the first try. Working through error messages and sleuthing through them is part of the process, and you’ll see that occasionally in these videos.

Please feel free to share the videos if you find them useful. I’d also appreciate any feedback you might have to offer.

Video 1: Hardware Setup for the Raspberry Pi

Video 2: Setting up the Raspberry Pi’s Raspbian Operating System

Video 3: Using the Raspberry Pi’s Text and Graphical Operating Systems

Video 4: Installing R

Video 5: Twitter, Tweepy and Python

Video 6: Debugging

Video 7: Saving Twitter Posts in a CSV File

Video 8: Extracting and Saving Data on Twitter URLs, Hashtags, and Mentions

Video 9: Custom Input

2015 American Community Survey Table: U.S. Immigrants are Less Likely to be in Adult Corrections Facilities than those born in the U.S.A.

2015 American Community Survey: Immigrants Less Likely to be Housed in Adult Corrections Facilities

Every September, the U.S. Census Bureau releases data regarding the U.S. population from its annual American Community Survey. The American Factfinder website very handily archives this data and makes it available through guided or customized search.

I particularly encourage you to visit American Factfinder and search for a table titled “CHARACTERISTICS OF THE GROUP QUARTERS POPULATION BY GROUP QUARTERS TYPE.” That table sounds dry and uninteresting, but it contains a nugget of gold for any voter who wants to fact-check claims being made lately about immigrants.  In press releases and in speeches this year, political officeholders and candidates have asserted that immigrants to the United States are dangerous and liable to commit crimes.  Of course, it is possible to find tragic stories of crimes committed by immigrants to the United States, just as it is possible to find tragic stories of crimes committed by people born in the United States.  But individual stories are not a good basis for policy. Claims about immigrants as a source of crime are strong in their accusation and as such need to be evaluated on the basis of systematic evidence.

To cut to the chase, data from this table reveal that immigrants make up a lower share of people held in adult corrections facilities in the United States than their share of the U.S. population.  “Native born” Americans — those born in the United States — made up 86.5% of the U.S. population in the 2015, but made up 91.9% of those housed in adult correctional facilities in the United States in 2015.  The “foreign born” immigrants to the United States made up 13.5% of the U.S. population in 2015, but made up only 8.1% of those housed in adult correctional facilities in the U.S. in 2015:

2015 American Community Survey Table: U.S. Immigrants are Less Likely to be in Adult Corrections Facilities than those born in the U.S.A.

This data does not appear to be consistent with the claim that foreigners coming to the United States to live are a unique and concentrated source of crime.  Trends for 2015 match findings for previous years compiled for the National Bureau of Economic Research.  Those who wish to pursue policies against immigrants on the basis that doing so would cut crime rates in the United States need to explain how their assertions match these observations.

Social Media Accounts of Candidates for the Maine State Senate

Deciding who to vote for in state legislative campaigns can sometimes be tricky because thorough coverage of local candidates can be hard to find. In the state of Maine,  state legislators in Maine are known for their accessibility. This may be because Maine’s legislative districts tend to be small; it may also be due to the friendly nature of Maine folk in general. Whatever the reason, getting in touch with candidates for Maine political office is both important and possible.

In this day and age, the quickest way to learn about state legislative candidates and to find their contact information is through social media platforms like individual web pages, Facebook and Twitter.  To help you in that process, the I’ve put together a spreadsheet with information about the social media presence of the 70 candidates for the Maine Senate in 2016, along with some additional contextual information. To download this information for personal use, click here for a Microsoft Excel file.

This sort of information changes all the time — if you have updated information about new accounts, please share a comment below to let me know, or write to james.m.cook@maine.edu.

Two Fact Checks on Donald Trump and Crime

In tonight’s speech, Donald Trump will accept the presidential nomination of the Republican party. The text of Trump’s speech makes the following claims regarding crime in the United States:

“These are the facts:

“Decades of progress made in bringing down crime are now being reversed by this Administration’s rollback of criminal enforcement.

“Homicides last year increased by 17% in America’s fifty largest cities. That’s the largest increase in 25 years.”

Let’s look at these two claims and check the facts.

Fact Check of Claim 1: “Decades of progress made in bringing down crime are now being reversed by this Administration’s rollback of criminal enforcement.”

Response: The annual FBI report Crime in the United States provides the most recent data on crime, both in the United States overall and in particular communities. Annual reports are released every fall to describe crime in the year before, based on direct reports of police officers all over the country (the delay occurs because it takes time to gather all those reports and carefully tabulate them). The most recent report was released in 2015, describing crime in the year 2014. Anyone who tells you they know about U.S. national crime trends for any more recent year is fibbing — because 2016 isn’t over yet, and because final counts for 2015 are still being worked on.

The trends on violent and property crime victimization rates in the United States are shown below, from the very first page of the 2015 Crime in the United States report, released at the end of September 2015:

violent and property victimization rates in the United States from 1993 to 2014

People can disagree about policy, but it is not possible for policy changes to have led to a reversal in progress in the crime rate in the United States, because there is no evidence that such a reversal exists.

Fact Check of Claim 2: “Homicides last year increased by 17% in America’s fifty largest cities. That’s the largest increase in 25 years.”

We don’t actually know whether this is the case, because final data for 2015 is not yet available. A preliminary count, that is not a final count, that is only for the first six months of 2015, and that is only for cities with over 100,000 in population — has been released. Here it is. Let’s realize, based on this data (look at Table 4), that:

First, we do not yet actually have a final count for 2015.

Second, on the basis that only the first six months of 2015 have been counted, it is not possible to make the conclusion that Donald Trump makes regarding the entire year.

Third, if we actually look at the fifty largest cities in the United States, and look at the preliminary count for the first six months of 2015 (not the entire year), we find that the homicides are up 8.4% in America’s fiftiest largest cities compared to 2014 — Donald Trump’s speech claims twice as much as this.

Fourth, it’s interesting that the speech only focuses on homicide, and not on violent crime in general. The increase in the violent crime rate from 2014 to 2015 is 3.1%.

Fifth, even these rises do not take into account the rise in population of America’s fifty largest cities, increasing the population, which will of course increase the number of murders.

Sixth, even this increase, in the context of the huge falls of the last twenty years, still marks a low crime rate in America’s fifty largest cities in recent history. The preliminary homicide rate in the fifty largest cities of the United States in the first six months of 2015 was 4.06 homicides per 100,000 people. In the first six months of 2015, the overall violent crime rate was 305.7 per 100,000 people. By comparison, in 2008, when Barack Obama was elected president and the decline in homicides was already well underway, the homicide rate was 12.1 per 100,000 people in America’s 50 largest cities, and the overall violent crime rate was per 852.9 per 100,000 people. In other words, since Barack Obama became president, if the 2015 preliminary data holds, the homicide rate is down 66.4% and the violent crime rate is down 64.2%.

It turns out that Donald Trump’s claim is based on a post made in very early estimate by a blogger using very early data in January 2016, less than a month after 2015 ended.

This second claim by Republican nominee Donald Trump, like the first, is not supported by the facts.

Learning Unbounded: EdX Introduction to R

It’s an open secret: to be a university professor is to be a perpetual student.  Learning doesn’t stop with the PhD; there’s always something new to read, always something new to discover, always something new to write, always something new to analyze, always a new technique to understand. This is why academics love the summer: finally, after teaching what we’ve already learned, we can learn some more!

One of my projects this summer is to bone up on the basics of a computer program for data analysis and visualization called R.  When I was a graduate student in the 1990s, statistical software was produced exclusively by companies at a fairly steep price.  Even now SAS 9.4, a software package used for data analysis in the academic and business communities, costs many thousands of dollars for an individual license (it’s so expensive that SAS won’t publish its price publicly).  If you were lucky, you had access to a university lab with software already installed.  If you didn’t have access and you wanted to run an analysis beyond the simplest level, you were simply out of luck.

All that changed with the introduction of R, a free and open-source program that runs on Windows computers, Mac computers, Unix computers and even web servers.  Methodologists from all kinds of disciplines are increasingly devoted to the development and extension of R, meaning that the latest analytical techniques are regularly added to R through easily added plug-ins called “packages.” R is easy to download, quick to install, and …

… well, I’d like to say it’s easy to run, but the truth is that for a generation that has grown up using pointing and clicking, it may be a bit intimidating to see a program with a command prompt that requires you to work almost entirely by entering text commands at prompts or developing programs of saved commands:

Screenshot of R running in the Windows environment

Still, with a bit of practice, it’s not much harder to type in text commands than it is to choose options in a drop-down menu.  The difference is that with drop-down menus, all options are presented to you in an organized fashion.  When you use R, you have to start out knowing what the commands are, and if you don’t know, you have to go find out.  It’s not R’s responsibility to show you what to do; it’s your responsibility to learn what R can do.  This is learning unbounded.

I became familiar with R by necessity earlier this year, when I needed to generate robust variance estimates in order to account for clustering in a sample.  That option isn’t available in most free menu-driven statistical programs, and I had a budget of $0 for my research project, so I installed R and the package rms by Frank E. Harrell, Jr.  R got the job done.

Since then, I’ve become aware that R can do much more than run a statistical analysis.  It can be used to gather data automatically.  It can be used to write automated webpages.  It can be used to create simulations.  It can visualize patterns in data with amazing graphics and videos (browse through the Google+ community for Statistics and R to get a taste of the possibilities).  But this level of high-end performance requires a more fundamental understanding of R than I’ve got right now.  To get back to basics and build myself a good foundation of understanding, I’ve started EdX’s Introduction to R Programming course.  This is another example of learning unbounded.  It’s an entirely online educational experience, I haven’t paid a cent to enroll, and I’m finding myself interacting with people from all over the globe in the course’s discussion sections.  Students in this course are asked to introduce themselves and say a little bit about where they’re from.  On a whim this morning, I tallied up the countries represented among students in the R course.  They are:

The United States isn’t even the top spot for R students; that position is taken by India, and there are 48 nations sending at least one student to the course. Just as the way we produce knowledge is changing, so is the way we learn how to produce knowledge.

P.S. Faced with a generation of academic and business analysts flocking to R, SAS has lost significant market share. Earlier this year, SAS responded by making a partial version of its software available for free. This software is called SAS University Edition and can be downloaded here. I’ve found installation to be more complicated and time-consuming than for R (the whopping download of a 1.8 GB installation file and the need to first install Oracle VM VirtualBox management software accounts for most of this difficulty), but I’m hopeful that I’ll have this second package of analytical software up and running soon so that I can compare the ease and power of the two programs.

Opening Maine Campaign Contribution Data Gets Tricky

Over the past year, I’ve been developing an Open Maine Politics website to mix, share and make social a variety of kinds of information about the Maine State Legislature.  Campaign finance profiles for legislators are part of the developing picture, but this weekend I’m hitting a speed bump as inconsistencies in the Maine Ethics Commission’s official dataset force me to look more closely at each case and fix errors one by one.  Cleaning the data feels like spring cleaning.  At least the season’s right.

A Map of Popular Connotations for 12 Social Media Sites, Winter 2014

If I say “Facebook is…,” how would you complete the sentence?

The response of any individual person to that question may be idiosyncratic, but when we look at the aggregate patterns that build up across the responses of many people, trends emerge that reflect our cultural beliefs and values regarding social media.  One convenient way to track trends is through Google Autocomplete.  When you enter a term in the Google search bar, have you ever noticed that certain suggestions appear to complete your thought automatically?

Google Autocomplete suggestions in November of 2014 for Facebook Is...

These are not random suggestions.  Rather, they reflect a weighted combination of how often different phrases appear in other Google “users’ searches and content on the web.”  Speaking in sociological terms, they are an indication of the most salient cultural associations with the phrase you’ve started typing.

In the autocompletion of “Facebook is…” that you see above, results are presented as a simple list of items, but it’s possible to obtain richer information than this. First, I’ve nabbed Google’s autocompletion lists for 12 of the most popular English-language social media platforms: Facebook, Twitter, Tumblr, LinkedIn, Vine, Flickr, MySpace, Ello, Instagram, Pinterest, Google+, and YouTube. To each platform’s name I’ve added the prompting word “is” and found up to 10 most-popular search suggestions (Some new platforms like Ello have low enough search volume to generate few results. Some other platforms have repetitive results I’ve combined — “Flickr is slow” and “Flickr is too slow” are just counted as “Flickr is slow.”). An interesting feature of these lists is commonality. Despite the rich variety and nearly endless possibility of the English language, many words to complete the phrase “_______ is…” appear on Google’s top 10 list for more than one social media platform. For instance, the phrase “______ is slow” is among the top 10 results for Facebook, Tumblr, Flickr, Pinterest and YouTube. The phrase “_______ is dead” is among the top 10 results for a full 9 out of the 12 social media platforms studied here.

To graph commonalities, I’ve created the 2-mode semantic network graph you see below. A 2-mode (or “bimodal”) graph is one in which there are two kinds of nodes indicating two different kinds of objects. In this graph, social media platforms are the first kind of node, and they are indicated in yellow. The second kind of node is a top-10 ending of the phrase “________ is” by Google autocomplete. These are color-coded pink if the phrase completions indicate negative sentiment, green if the phrase completions indicate positive sentiment, and white if there is no clear sentiment expressed with the phrase completion. For some ambiguous phrases such as “YouTube is on fire” and “Pinterest is ruining my life,” a quick browse through Google search results helps to make sentiment more clear (both of these phrases turn out to be complimentary). Finally, a line is drawn from a social media platform to a phrase if that phrase is listed in the top 10 Google autocomplete results for that social media platform.

Social Media Is... Most Common Associations of Popular Social Media Sites as Identified through Google Autocomplete

For the 12 social media platforms, there are 68 distinct phrase completions listed in the Google autocomplete top 10. A large majority of these phrase completions communicate clear sentiment, and a large majority of those sentiments are criticisms. Mentions of slow speed, crashes and unavailability appear common. With the exception of YouTube and Pinterest, all of the 12 social media platforms are popularly depicted as “dead” or “dying.” Predictions of doom for social media platforms appear to be a cultural universal, at least among the socially-distinct set of participants in social media and web searches. Facebook, LinkedIn, Vine, Flickr, Ello and Instagram have no positive phrases listed in their autocompletions. A strikingly positive deviation from the negative trend appears for MySpace. This finding is unintuitive, considering how far interest in MySpace has fallen since 2008. Consider the trend in Google search volume for “MySpace” from 2004-2014:

Relative Search Volume for MySpace in Google, via Google Trends, 2004 to 2014

The letters on that graph indicate influential mainstream news articles mentioning MySpace; does the lack of any articles whatsoever since 2010 hint at an explanation? Without newspaper or magazine articles promoting the MySpace network, and with hardly anyone searching for Myspace anymore, who is left but a small group of true believers in the once-great social network? The strongly positive sentiment toward MySpace in its top-10 rankings may be due to positivity in the small set of people who are still paying attention.

What other patterns do you notice in this graph of popular search completions for social media platforms? Do the autocompletions distinguish between different social media platforms, or do they unify?

Gas Prices in and out of Context: Hi and Lois need a Fact Check

On October 18 2014, the comic strip Hi and Lois comic strip looked back with fondness on a time when gas prices were just 35.9 cents a gallon.  At the present day, the middle-class character Hi grimaces as he pumps gas costing $3.99 cents a gallon.  In a meta-analysis of existing research, social scientist Michael R. Hagerty found that people tend to view their own lives as getting better but at the same time tend to look backward in time and conclude that the lot of the average person is getting worse.  In other words, we use rose-colored glasses to view our own lives, but gray-tinted glasses to view trends in the world in general.

Hi’s view of the world is certainly tinted gray in the strip you see below, but is this pessimist funk merited?  I don’t think so; the way out of the trap of our psychological biases is to check for sociological context.  Doing that, I’d alter the Hi and Lois strip from the original into a more realistic new version:

Put Hi and Lois in Context -- are gas prices in 2014 really that bad?

Correction 1: Gas hasn’t had a price of $3.99 per gallon in the United States since July of 2008. The average price per gallon of gas in the United States was down to about $3.10 in the middle of October 2014, and they’re getting even better a month later. Source: St. Louis Federal Reserve Bank Economic Research Database.

Correction 2: The last time gas cost 35.9 cents a gallon in the United States was the year 1969, but that literal price doesn’t tell the whole story; those 35.9 cents were worth a whole lot more in 1969 than they are worth today. If we adjust for inflation, paying 35.9 cents in 1969 had the same punch to our wallets as paying $2.32 today. Sources: Bureau of Labor Statistics and InflationData.com.

Correction 3: Why do we put gasoline in cars? To go somewhere. Chance Brown forgets that the fuel efficiency of cars was far different in 1969 from the fuel efficiency we experience nowadays. In 1969, passenger cars traveled 13.6 miles on a gallon of gas, on average. In 2013, the last full year for which data is available, passenger cars traveled 36.0 miles on a gallon of gas, on average. Sources: U.S. Department of Transportation and Federal Highway Administration.

If we put all these pieces of information together, it turns out that on average and adjusting for inflation, it took 17 cents to travel a mile in a car in 1969. In contrast, it only takes 8.6 cents to travel a mile in a car today.  The depiction of gas prices as a rising social problem doesn’t match the cheaper cost of transportation today.  There may be other social problems associated with fossil fuel transportation, but economy is not one of them.  Unless Hi is driving an extra-large SUV and driving his fuel efficiency far below average, he should be smiling, not frowning.  Even and especially when trends seem obvious, it’s important to put them in context.

1 2