Social Media Data Mining with Raspberry Pi: 9 Videos for the Complete Beginner

Since the start of this year, I’ve been working on a project to take a $30 Raspberry Pi 2 computer turn it to create a social media data mining machine using the programming language Python. The words “programming language” may be off-putting, but my goal is to work through the process step-by-step so that even a complete beginner can follow along and accomplish the feat.

The inexpensive, adaptable $30 Raspberry Pi 2I’m motivated by two impulses. My first impulse to help people gain control over and ownership of the information regarding interaction that surrounds us. My second impulse is to demonstrate that mastery of social media information is not limited to the corporate, the government, or the otherwise well-funded sphere. This is not a video series for those who already are technologically wealthy and adept. It’s for anyone who has $30 to spare, a willingness to tinker, but the feeling that they’ve been left out of the social media data race. I hope to make the point that anyone can use social media data mining to find out who’s talking to whom. The powers that be are already watching down at us: my hope is that we little folks can start to watch up.

I’m starting the project by shooting videos. The video series has further potential, but has proceeded far enough along to represent a fairly good arc of skill development. Eventually I’d like to transcribe the videos and create a written and illustrated how-to pamphlet; these videos are just the start.

Throughout the videos, I’ve tried not to cover up the temporary mistakes, detours and puzzling bugs that are typical of programming. No one I know of hooks up the perfect computer system or writes a perfect program on the first try. Working through error messages and sleuthing through them is part of the process, and you’ll see that occasionally in these videos.

Please feel free to share the videos if you find them useful. I’d also appreciate any feedback you might have to offer.

Video 1: Hardware Setup for the Raspberry Pi

Video 2: Setting up the Raspberry Pi’s Raspbian Operating System

Video 3: Using the Raspberry Pi’s Text and Graphical Operating Systems

Video 4: Installing R

Video 5: Twitter, Tweepy and Python

Video 6: Debugging

Video 7: Saving Twitter Posts in a CSV File

Video 8: Extracting and Saving Data on Twitter URLs, Hashtags, and Mentions

Video 9: Custom Input

Finding and Extracting Variables from Web Pages with PHP: A How-to for Social Scientists in the Rough

“Data Mining”: Just Another Way for Social Scientists to Ask Questions

If social science is the study of the structure of interactions, groups and classes, and if interactions, groups and classes are increasingly tied to the online environment, then it is increasingly important for social scientists to learn how to collect data online. Fortunately, the approach to “data mining” online interaction is fundamentally the same as the approach to studying offline social interaction:

  1. We approach the subject,
  2. We query the subject, and
  3. We obtain variables based on the responses we’re given.

Because the online environment and our online subjects are different, the way we make online queries must be different from the way we make offline queries. In data mining we don’t question human beings who can flexibly interpret a question; instead, we question computers responsible for the architecture of the online social system, and they will only respond if questioned in precisely the right way.


Learning to Mine the Web for Social Data — Without a Computer Science Degree

I’ve been trying to learn how to mine social information from websites on my own, without the benefit of any formal education in computer science.  This is kind of fun even when it’s frustrating, as long as I remember that getting information from the online environment is like solving a puzzle.  On most websites, social information (relations, communications, and group memberships) is stored in a database (like XMLSQL or JSON); some content management software (like WordPress, Joomla or Drupal) takes the information stored in a database and posts it on web pages, surrounded by code that makes the information comprehensible to humans like you and me.  If websites are researcher-friendly, they allow databases to be queried directly through an Application-Programming Interface (API).

Many websites don’t let a person query their databases, even when all the information published on those websites is public.  What’s a social scientist to do?  Well, we could literally read each single web page, find the information about relations, communications and group memberships we’re interested in, write down that information, and enter it into our own database for analysis.  We could do this, hypothetically, but at the practical scale of the Internet it’s often impossible.  Manually collecting interactions on a website with 10,000 participants could take years — and by the time we were done, there would be a whole new set of interactions to observe!

Fortunately, because web pages on social websites are written by computers, there are inevitably patterns in the way they’re written.  Visit a typical page on a social media website and use your browser’s “View source” command to look at the raw HTML language creating that page.  You’ll find sections that look like this:

<div class=”post” postid=”32“><div class=”comments”><a name=”comments”></a><h3>3 Comments on “Lucille’s First Blog Post”</h3><div class=”commentblock”>
<div class=”comment” id=”444“><a href=”/member.php?memberid=”201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>
<div class=”comment” id=”445“><a href=”/member.php?memberid=”1181” usertitle=”Lucille – click here to go to my blog”> Lucille</a>: Hey, Tom. I’m new here. How do I respond to your comment?</div>
<div class=”comment” id=”446“><a href=”/member.php?memberid=”201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Congratulations, Lucille, you just did!  Welcome to the community.</div>

That may look like a cluttered mess, but if you look carefully you can find important information.  Some of that information is the content that users write.   Other pieces of information track posts, comments and users by number or name. These names and numbers (indicated in red above) can be thought of as social science variables, and encouragingly they’re placed in predictable locations in a web page:

variable preceded by followed by
post id <div postid=” “><div >
comment id <div id=” “><a href=”/member.php?
member id member.php?memberid=” ” usertitle=”
member name  usertitle=”  – click here to go to my blog

There should be a set of rules for finding these predictable locations, and my goal in data mining is to explain those rules in a computer program that automatically reads many pages on a website, much faster than I can read them.  In English, the rules would look like this:

“Find text that is preceded by [preceding text] and is followed by [following text].  This text is an instance of [variable name].”

Unfortunately, computers don’t understand English.  I am familiar with a language called PHP that can read lines of a web page.  I didn’t know of a command in PHP that would let me carry out the rule described above.  What to do?  Ask a friend.  I asked a friend of mine with a PhD in Computer Science if he could identify such a command in PHP. His answer: “Well, you don’t want to use PHP. The first thing to do is teach yourself Perl.” The Perl programming language, he went on to explain, has much more efficient and flexible approach to handling strings as variables, and if I was going to be serious about data mining efficiently, I should use Perl.

I can’t tell you how many times some computer science expert has told me I shouldn’t follow a path because it was “inelegant” or “inefficient.”  Well, that may be wonderful advice for professional computer programmers who have to design and maintain huge information edifices, or to those who have a few extra semesters to spare in their learning quest, but in my case I say a hearty “Baloney!” to that.  Research does not need to and often cannot wait for the most efficient or elegant or masterful technique to be mastered.  Sometimes the most important thing to do is to get the darned research done.

In my case, this means that I’m going to use PHP, even though it may not be elegant or efficient or flexible or have objects to orient or [insert computer science tech phrase here].  I’m going to use PHP because I know it and it will — clumsily or not — get the darned job done.  Good enough may not be perfect but it is, by definition, good enough.  As long as the result is accurate, I can live with that.


A Rough but Ready Method for Extracting Variables from Web Pages with PHP — Explode!

It took a bit of reading through PHP’s online manual, but eventually I found a method that works for me — the “explode” command.  In what follows, I’m going to assume that you are familiar with PHP.  If you aren’t, that’s OK — you’ll just have to find another way to extract information out of a web page.

The PHP command “Explode” takes a string — a line of text in a web page — and splits it into parts.  “Explode” splits your line of text wherever a certain delimiter is found.  A delimiter is nothing more than a piece of text you want to use as a splitting point.  Let’s use an example, the web page snippet listed above:

<div class=”post” postid=”32″><div class=”comments”><a name=”comments”></a><h3>3 Comments on “Lucille’s First Blog Post”</h3><div class=”commentblock”>

<div class=”comment” id=”444″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

<div class=”comment” id=”445″><a href=”/member.php?memberid=”1181″ usertitle=”Lucille – click here to go to my blog”> Lucille</a>: Hey, Tom. I’m new here. How do I respond to your comment?</div>

<div class=”comment” id=”446″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Congratulations, Lucille, you just did! Welcome to the community.</div>


Let’s say I’d like to look through 5,000 web pages like this, representing 5,000 individual blog posts.  In each of these 5,000 web pages, the particular post id and comment ids and member ids may change, but the places where they can be found and the code surrounding them remain the same.  We’ll use the code surrounding our desired information as delimiters.

To get really specific, let’s say I’d like to extract a member id number from the above web page every place it occurs.

The first step is to find a line of the web page on which a member id number exists.  To do this, I’ll use the stristr command, which searches for text. The command if (stristr($line, ‘?memberid=’)) {…} takes a look at a line of a website ($line) and asks if it contains a certain piece of text (in this case, ?memberid=).  If the piece of text is found, then what ever commands inside the brackets { } are executed.  If the piece of text is not found, then your computer won’t do anything.

So far, we have:

if (stristr($line, ‘?memberid=’))


What goes inside the brackets?  Some exploding!  Our first line of code inside the brackets tells the computer to split a line of website code using the text memberid= as the delimiter.

$cutstart = explode(‘memberid=’, $line);

This leaves a line of website code in two pieces, with the delimiter memberid= removed.  Those two pieces are set by the explode command to be $cutstart[0] and $cutstart[1]:

Original line of text: <div id=”444″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

$cutstart[0]: <div id=”444″><a href=”/member.php?

$cutstart[1]: “201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

Where’s the member id number we want?  It’s the number right at the start of $cutstart[1], sitting in between the double quotation marks.  To get at that, let’s add another line of code to explode $cutstart[1] which tells the computer to split $cutstart[1] into pieces at the spots where there are double quotation marks.  The command in the second line of code inside the brackets is:

$cutend = explode(‘”‘, $cutstart[1]);

and takes $cutstart[1] apart, turning it into the pieces $cutend[0]$cutend[1], $cutend[2], $cutend[3] like so:

original $cutstart[1]: “201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

$cutend[0]: 201

$cutend[1]: usertitle= 

$cutend[2]: Tim – click here to go to my blog

$cutend[3]: > Tim</a>: Greetings! How are you, Lucille?</div>

Which part am I interested in?  Only the member id number, and finally that’s what I’ve got in $cutend[0].  If I want, I can rename it to help me remember what I’ve got:

$memberid = $cutend[0];

Taken all together, the code looks like this.

if (stristr($line, ‘?memberid=’))
$cutstart = explode(‘memberid=’, $line);
$cutend = explode(‘”‘, $cutstart[1]);
$memberid = $cutend[0];

This may not be the most elegant or efficient solution, but it’s pretty simple — and most importantly, gosh darn it, it works.  A novice data miner like me will never get hired away by Google for basic programming like this, and if you’re a social scientist with mad programming skills you may scoff at the elementary nature of this step.  That’s OK; this isn’t written for the Google corporation or wicked-fast coders.  I wrote all this out because the code was a big step for me in becoming a better, more complete social scientist.  If you’re looking to take the same step, I hope this post helps you along.

Credit goes to Tizag for helping me to understand the “explode” command a bit better. In turn, if you can think of a way for me to explain this more clearly or fully, please let me know by sharing a comment.

YouTube, Socially HalfBaked

In undergraduate courses, I often exhort students to express their ideas in measurable terms and to make sure that what they think they’re measuring and what they’re actually measuring have a reasonable connection.  This could be seen as the worry of a fussy academic, but there are real consequences to fuzzy thinking and fuzzy measurement in what some people call “the real world.”  I recently came across a “real-world” example of fuzzy research in the field of social media analytics that I’d like to share with you.   As this example shows, the use of trendy and colorful infographics can’t always bridge an information gap.

Thinking about YouTube: All Views? Views Per Video? Average Video Length?

On November 27 2013, the social media analytic company SocialBakers released a report in which it confidently declared that “Videos Under Two Minutes Generate the Most YouTube Views.” This is an ambiguous claim with at least two possible meanings:

Possible Meaning #1: If we count all YouTube views, most of the views will be of videos under two minutes long.
Possible Meaning #2: A video of less than two minutes in duration will tend to obtain more views than a longer video.

These possible meanings may sound similar, but they are substantially quite different. Meaning #1 brings to mind the saying that “most car crashes happen within a mile of home.” This may be true, but that fact doesn’t imply that driving close to home is more dangerous because we also do most of our driving within a mile of home. In the same vein, it might be that most video views are for videos that are under two minutes long, but if most videos are under two minutes long, that’s not at all surprising.

What we really want to know if we’re driving is what locations are more risky. For every mile we drive closer to home, are we more or less likely to crash? If we’re posting YouTube videos with the hope of obtaining views, what we really want to know is whether a single short video tends to snag more views than a single medium-length video or a single extended-length video. That question is expressed in Meaning #2.

It appears from the following text that SocialBakers is interested in testing the question expressed in Meaning #2:

“Using YouTube to reach your Fan’s can be a tricky proposition. Done right, and you can create something that your audience will remember for a long time after, and will want to share with their friends. Videos have the potential to really go viral. But how long should a video be? Make it too long, and people will be yawning and looking for something more interesting to occupy their time. Make it too short, and you might risk your content being easily forgettable and your message undelivered. We did some data investigation to get to the bottom of what video length, on YouTube, will makes the biggest impact….”

Sounds straightforward, doesn’t it? But watch as SocialBakers nimbly shifts back to Meaning #1:

“To do this, we looked at the 300 most viewed channels among different industries. The first thing we noticed is that videos between 16 seconds to 120 seconds generate almost 50% of all views on YouTube. The most successful videos are almost unanimously below 2 minutes in length.”

Did you notice the shift? In the second sentence from that passage, they’re measuring the number of views for all videos and comparing it to the number of views for all videos between 16 and 120 seconds. The problem is that there may just be a whole lot of videos between 16 and 120 seconds long — if so, it’s no wonder that they account for all those views. What we need to know to figure out whether this information is useful is another piece of information: what percent of YouTube videos are between 16 seconds and 120 seconds long. If such videos make up 70% of YouTube videos, then it’s not at all impressive that they generate 50% of all views. In fact, that result would be underwhelming. If, on the other hand, such videos make up just 20% of YouTube videos, then it would be quite impressive for them to garner 50% of all views.

Well, what does SocialBakers actually measure? To figure this out, let’s look at the company’s slickly-produced infographics from its brief report:

SocialBakers: Videos under two minutes generate the most YouTube views

This infographic doesn’t clarify matters at all. The numbers reported are percentages, but what are they percentages of? If you look closely, you’ll notice the large-text title implies that the percentages in the graphic are percentages of views (“generate the most YouTube views”). On the other hand, the tiny text underneath the graphic tells us that what SocialBakers has calculated is the “average length of YouTube videos,” not the share of views generated by YouTube videos.

SocialBakers’ second infographic makes it clear what’s going on. Take a close look at the numbers listed below, which are labeled “Lengths of YouTube Videos”:

SocialBakers: Common Lengths of YouTube Videos

All of the counts at the top of each bar add up to 579,112 videos. Those must be counts of videos, not counts of views, because a just one recent video from the top channel, PewDiePie, has gained nearly 2 million videos. The number of videos of 0-15 seconds (50,505) is 8.8% of 579,112. The number of videos of 16-30 seconds (90,619) is 15.6% of 579,112. The second infographic confirms for us that the first infographic is measuring the commonality of videos of different lengths — not the share of views obtained by videos of various lengths. Those two different-looking infographics are really just sharing the same information in different layouts.

SocialBakers’ infographics don’t have tell us whether a long video tends to obtain more views than a short video, because the infographics don’t measure the number of views per video. Those infographics don’t describe views at all (and there is no more data described in SocialBakers’ report to make up for this lack). Regardless, SocialBakers concludes that “Everyone Loves Short and Sweet Videos,” that “it is often far more effective to take up a small amount of viewing bandwidth in order to keep your audience entertained,” and that “you usually can’t go wrong by making sure your video is short and sweet.” Let’s not forget the title of SocialBakers’ report: “Videos Under Two Minutes Generate the Most YouTube Views.

Check That Data… If You Can

SocialBakers’ conclusions in the headline and text of its report don’t follow at all from the information SocialBakers has presented, but the uncomfortable truth is that most people will nod their heads and accept those conclusions anyway. If video producers follow SocialBakers’ recommendations on the basis of this report alone, they do so at their peril. If you are a consumer of social media advice, it is wise for you to be in the minority who check out claims.

A more thorough way to check out claims would be to replicate SocialBakers’ study. In order to carry out a replication, however, we would need to know what SocialBakers actually did in its study. SocialBakers shares some information in its infographics: we know from those graphics, for instance, that SocialBakers studied videos in the date range of July 1 to September 23, 2013. But did it study all new videos introduced during that period? All existing videos introduced during that period? Some other quantity entirely? We don’t know. We’re also unclear about how many videos SocialBakers measured; was it “videos from the top 300 most viewed brand channels across different industries” (infographic #2) or “videos from a sample of the top 300 most viewed brand channels” (infographic #1)? What kind of sample? What industries were selected and by what standard? Since we don’t know these details, we can’t replicate SocialBakers’ study to directly test its claims. This is probably not a mistake. If SocialBakers told you exactly how to replicate its work, after all, it would be releasing a proprietary business secret. Social media consulting as a business thrives on some secrecy, unlike social research as an academic pursuit, which thrives on the sharing of technique.

What we’ll have to settle for is a more indirect replication. This indirect replication starts with SocialBakers’ central claim for video producers: that a short video will gather more views than a long video.  SocialBakers has a 230-employee-strong stable of employees that can muster.  As a single busy individual, I’ll have to look at YouTube videos on a more modest scale.   I can take a fairly good look nonetheless: to follow the spirit of SocialBakers’ notion, I looked at the 10 YouTube channels with the most subscribers on November 30 2013:

1. Spotlight
2. PewDiePie
3. Smosh
4. HolaSoyGerman
5. JennaMarbles
6. RihannaVEVO
7. nigahiga
8. RayWilliamJohnson
9. OneDirectionVEVO
10. Machinima

I’ve gathered information on the length of, and number of views of, the ten most recent videos from each channel, resulting in 100 videos. This is an admittedly small set compared to that obtained by SocialBakers, but it has two advantages. First, these are the most recent successful videos by the most successful channels on YouTube, so if we are interested in emulating success, this is where we ought to look. Second, the procedure by which I obtained these measurements is “transparent,” meaning that I’ve told you exactly how it’s done. If you don’t believe my results, you can replicate my work to show me I’m wrong.

Let’s look at the results I obtained in three ways. First, we’ll look at the simple number of videos of various lengths. Because there are 100 total videos, these counts can also be read as percentages:

Number of Videos of Various Lengths (source: 10 most recent videos from each of the 10 most-subscribed YouTube video channels)

The results here are quite striking: the most common video length is not between 31 seconds and a minute, as reported in SocialBakers’ chart, but rather between 5 and 10 minutes. The ten most successful YouTube channels produce relatively lengthy videos, not short ones: only 5 out of their most recent 100 videos are of a minute or less in length, and only 9 out of the most recent 100 videos run for two minutes or less.

Second, let’s look at the raw number of views of these 100 videos:

Number of Video Views in Ranges of Different Video Lengths for the 10 most recent videos of the 10 most popular YouTube Channels

With over 1.1 billion video views, the videos between 3 minutes and 10 minutes in length clearly have the most views. However, from our first chart above we also know that videos between 3 minutes and 10 minutes in length account for the largest number of videos (72 out of 100 of them). Is the dominant presence of video views in this range due simply to the number of videos in the range? To find out, we can divide the total number of views in each length category by the total number of videos in a category. The result is the average number of views per video in a category, graphed below:

Average Number of Views per Video, by Length of Video, YouTube November 2013

Finally we can arrive at an answer to the question posed by SocialBakers: if we believe that the ten most popular video channels provide a model to emulate, and if we believe the length of a video is what drives people to view a video or not, then video producers seeking viewers would be well advised to upload videos of between 3 and 5 minutes in length. The next most advisable length for a video would be somewhere in the range of 5 to 10 minutes. Compared to the longer videos from these popular producers, videos of two minutes or less appear to be among the least popular on YouTube, not the most popular.

Keep Asking Questions

At this point, you may have more questions than answers. For instance, are the ten most popular video channels really the model to emulate? Could they have advantages that middle-range producers can’t touch? And is it possible that the length of a video isn’t what leads people to watch, but some other feature of a video that might itself be associated with length? To answer these questions, we’d need (yes) more research. But in order to get to this second tier of questions, we need to answer our first question — and that in turn means our measurements must be able to answer our question, and that we need to be specific in describing how our measurements are made.