Social Media Data Mining with Raspberry Pi: 9 Videos for the Complete Beginner

Since the start of this year, I’ve been working on a project to take a $30 Raspberry Pi 2 computer turn it to create a social media data mining machine using the programming language Python. The words “programming language” may be off-putting, but my goal is to work through the process step-by-step so that even a complete beginner can follow along and accomplish the feat.

The inexpensive, adaptable $30 Raspberry Pi 2I’m motivated by two impulses. My first impulse to help people gain control over and ownership of the information regarding interaction that surrounds us. My second impulse is to demonstrate that mastery of social media information is not limited to the corporate, the government, or the otherwise well-funded sphere. This is not a video series for those who already are technologically wealthy and adept. It’s for anyone who has $30 to spare, a willingness to tinker, but the feeling that they’ve been left out of the social media data race. I hope to make the point that anyone can use social media data mining to find out who’s talking to whom. The powers that be are already watching down at us: my hope is that we little folks can start to watch up.

I’m starting the project by shooting videos. The video series has further potential, but has proceeded far enough along to represent a fairly good arc of skill development. Eventually I’d like to transcribe the videos and create a written and illustrated how-to pamphlet; these videos are just the start.

Throughout the videos, I’ve tried not to cover up the temporary mistakes, detours and puzzling bugs that are typical of programming. No one I know of hooks up the perfect computer system or writes a perfect program on the first try. Working through error messages and sleuthing through them is part of the process, and you’ll see that occasionally in these videos.

Please feel free to share the videos if you find them useful. I’d also appreciate any feedback you might have to offer.

Video 1: Hardware Setup for the Raspberry Pi

Video 2: Setting up the Raspberry Pi’s Raspbian Operating System

Video 3: Using the Raspberry Pi’s Text and Graphical Operating Systems

Video 4: Installing R

Video 5: Twitter, Tweepy and Python

Video 6: Debugging

Video 7: Saving Twitter Posts in a CSV File

Video 8: Extracting and Saving Data on Twitter URLs, Hashtags, and Mentions

Video 9: Custom Input

Track Social Networks… to Find the People Tracking You

As the course designer and instructor for an undergraduate social networks course at the University of Maine at Augusta, I am often asked why students should take the course. I think there are many answers to this question. One answer comes from a humanities standpoint: learning how to represent patterns in relationships with attention to meaningful visual cues can deepen understanding of design and lead to innovation in art. Culturally speaking, networks have geek appeal as sparkling and colorful objects lending panache to infographics. If critical thinking is important to you, you might be interested in network analysis for the challenge of mastering multidimensionality and matrix mathematics; as you work at network puzzles you’ll develop your logical and quantitative reasoning ability. But these appeal aren’t all: the study of social networks can be practically useful, too.

One practical use of social network analysis is highlighted by the Disconnect extension you can add to your Chrome, Firefox, Safari, or Opera internet browser…

worried faceI should break in here. Whenever you read "extension you can add to your internet browser," you should begin to get nervous. Many add-ins, add-ons, and add-arounds to your internet browsing or Facebook or Twitter experience are so colorful and fun to play with. But they have a second purpose lurking behind the colorful and fun one: to track your movement across websites so someone can sell data about where you go and what you do. But when consulting Disconnect's privacy policy, I was pleasantly surprised to discover that the Disconnect extension collects information about you only minimally and doesn't sell information to advertisers: "Disconnect never sells your personal info.... Our browser extensions don't collect any of your personal info. Unlike most websites, our site doesn’t collect your IP address."

… so as I was saying, the Disconnect extension available for most internet browsers makes use of social network analysis to share useful information about websites that let your data leak out to third parties:

If you install the Disconnect extension in your browser, then visit a website, it will create a network graph (or “sociogram”) with that website at the center, visually linked to other websites that are given data whenever you visit that site. By bringing those network graphs together for different websites, you can figure out how your personal information might be combined and how that combination might be harmful to you.

That might sound a little abstract, so let me make it concrete. Consider the mini-industry on the internet of “Print-On-Demand” apparel. On websites like CafePress, Zazzle and Skreened, you can browse through thousands of t-shirt designs made up by people like you. If you find a design you like, you can put it on a t-shirt that fits your style, order that shirt, and have it printed up and sent specially to you. The printer gets a cut of the profits, the designer gets a cut of the profits, and you get just the shirt you want.

While these print-on-demand services are offering you a service that makes them a little money, are they harvesting your data on the sly? To find out, I activated the Disconnect extension in my browser and visited the CafePress, Zazzle and Skreened websites. Disconnect produced three sociograms, which I combine to form the network graph you see below:

How the Skreened, CafePress and Zazzle websites track your visits: February 2014

The above image is current as of February 2014, and represents an change in tracking since the last time I looked at these websites in December of 2012:

Skreened, CafePress and Zazzle website tracking technology habits: December 2012

There are a number of patterns to notice. Consistently and by a wide margin, CafePress has been sending information about you to the largest number of third-party websites. Over time, on the other hand, Skreened and Zazzle (to a lesser extent) have started to catch up, sending more information about you to other companies. Those companies include Lucky Orange (“We don’t just tell you who is on your site, we show you what they are doing”), Monetate (“helping you understand your customers’ situations, behaviors and preferences”), Retention Science (“analyze & predict customer behaviors”), and Tell Apart (“If you’ve ever clicked on an ad for a pair of shoes that seem like they were made for you, Tell Apart may very well have been responsible“).

When the practices of individual websites such as CafePress, Skreened and Zazzle are combined into a network, we can find points of overlap. CafePress and Skreened send their information to three websites in common: doubleclick.net, google-analytics.com, and googleadservices.com. Each of these services tracks users by IP address, so that your behavior at CafePress and your behavior at Skreened can be combined: these data mining companies can bring together your behavior at CafePress and your behavior at Skreened to figure out aspects of your identity and preferences that might not be apparent if they had access to only one of the websites. All three websites send data to googleadservices.com, leading to even more detailed insights about you. Would you be surprised to find out that doubleclick.net also receives information about visitors from nytimes.com, foxnews.com and amazon.com? Would it surprise you to know that doubleclick.net is owned by Google, bringing this overlap into even sharper focus?

Looking at simple lists of the third-party recipients of your information on a website can give you a rough sense of how leaky an individual website is. Looking at the network overlap in recipients tells you which of those recipients are likely to be learning the most about you, constructing an increasingly accurate virtual you for sale.

A Hashtag Contested: Positive and Negative Social Media Reaction to the RSA-NSA Scandal

For some time now, public relations professionals have been worrying about “the bashtag problem.” Corporations may spend years cultivating positive conversations about their products over social media by developing and promoting a hashtag, only to see “their” hashtag fall into bashtag status when negative social media posts about that organization swamp the positive posts the organization seeks. Upset that public criticism may “ruin their brand,” some corporations have developed intimidation strategies to shut up and shut down isolated critics. But when large numbers of people join in the bashtagging, there’s no easy way to stop the dissent.

Through the fall of 2013, cybersecurity corporation RSA enjoyed positive references on its #RSAC hashtag on Twitter that it had developed to advertise its annual professional conference. In late December, however, it emerged that RSA’s data encryption products had a “back door” built into them that allowed the National Security Agency (NSA) to break users’ encryption and (possibly without a warrant) snoop on private communications. On December 23, RSA issued a “non-denial” that seemed to implicitly acknowledge the arrangement. On that day, the positive flavor of the #RSAC hashtag changed.

After collecting the Twitter posts (or “tweets”) of the #RSAC hashtag using the Tweet Archivist Desktop, I’ve looked at the content of each one, determining whether its attitude toward RSA or the RSA Conference (RSAC) was positive, negative or neutral. The following graph tracks the volume of positivity, negativity and neutrality in the #RSAC hashtag from December 21 2013 through January 14 2014 (today):

Volume of Tweets Positive, Negative and Neutral Toward RSA in the #RSAC hashtag, 12/21/2013 to 1/14/2014

After an initial burst in which some prominent conference speakers canceled their participation in protest, it appeared that negative tweets regarding the RSA Conference might abate over the end-of-year holidays, and RSA began to use the channel to promote its conference again. Then, on January 7, RSA let out a teaser of a Tweet about the identity of its keynote speaker:

RSA Tweets on January 7 2014: Click here to find out who has been announced as #RSAC closing keynote speaker for 2014

That speaker is Stephen Colbert. With a celebrity drawn into the story, public attention returned, generating a new peak of critical #RSAC tweets that seems to be continuing. Some of those tweets are original, but the bulk of them constitute just a few messages, tweeted and retweeted over and over again over the #RSAC hashtag channel. Anti-surveillance social movement organization Fight For the Future has deployed a special web page

Fight for the Future asks its followers to send out automated tweets to overwhelm the #RSAC hashtag

… on which it asks its followers to share this message on Twitter: "Surveillance is no joke! Tell @StephenAtHome to cancel his keynote at this NSA tainted conference. http://cms.fightforthefuture.org/colbert/ #RSAC"

15.4% of all Tweets on the #RSAC hashtag from December 21 2013 to January 14 2014 are this one Tweet, posted over and over. Another Fight for the Future mass tweet, "Does Stephen Colbert secretly love the NSA? There's only one way to find out: http://t.co/SAVDMFup2I #RSAC," accounts for another 2.1% of #RSAC Tweets during the period.

Fight for the Future is part of a coalition of anti-surveillance groups who have announced a national day of protest on February 11. It’s called “The Day We Fight Back.” Where will the fight be? On the streets? Will there be a march? A picket? A rally in some square?

Apparently not. According to press materials, all activities will be taking place on the internet, where followers will be encouraged to share graphics on their blogs, to change their profile photos on Facebook, and to chant pre-written slogans over Twitter.

In American social movements, web banners are replacing cloth banners. Marches are giving way to orchestrated internet bashtagging. Yesterday’s gone, yesterday’s gone.