Finding and Extracting Variables from Web Pages with PHP: A How-to for Social Scientists in the Rough

“Data Mining”: Just Another Way for Social Scientists to Ask Questions

If social science is the study of the structure of interactions, groups and classes, and if interactions, groups and classes are increasingly tied to the online environment, then it is increasingly important for social scientists to learn how to collect data online. Fortunately, the approach to “data mining” online interaction is fundamentally the same as the approach to studying offline social interaction:

  1. We approach the subject,
  2. We query the subject, and
  3. We obtain variables based on the responses we’re given.

Because the online environment and our online subjects are different, the way we make online queries must be different from the way we make offline queries. In data mining we don’t question human beings who can flexibly interpret a question; instead, we question computers responsible for the architecture of the online social system, and they will only respond if questioned in precisely the right way.

 

Learning to Mine the Web for Social Data — Without a Computer Science Degree

I’ve been trying to learn how to mine social information from websites on my own, without the benefit of any formal education in computer science.  This is kind of fun even when it’s frustrating, as long as I remember that getting information from the online environment is like solving a puzzle.  On most websites, social information (relations, communications, and group memberships) is stored in a database (like XMLSQL or JSON); some content management software (like WordPress, Joomla or Drupal) takes the information stored in a database and posts it on web pages, surrounded by code that makes the information comprehensible to humans like you and me.  If websites are researcher-friendly, they allow databases to be queried directly through an Application-Programming Interface (API).

Many websites don’t let a person query their databases, even when all the information published on those websites is public.  What’s a social scientist to do?  Well, we could literally read each single web page, find the information about relations, communications and group memberships we’re interested in, write down that information, and enter it into our own database for analysis.  We could do this, hypothetically, but at the practical scale of the Internet it’s often impossible.  Manually collecting interactions on a website with 10,000 participants could take years — and by the time we were done, there would be a whole new set of interactions to observe!

Fortunately, because web pages on social websites are written by computers, there are inevitably patterns in the way they’re written.  Visit a typical page on a social media website and use your browser’s “View source” command to look at the raw HTML language creating that page.  You’ll find sections that look like this:


<div class=”post” postid=”32“><div class=”comments”><a name=”comments”></a><h3>3 Comments on “Lucille’s First Blog Post”</h3><div class=”commentblock”>
<div class=”comment” id=”444“><a href=”/member.php?memberid=”201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>
<div class=”comment” id=”445“><a href=”/member.php?memberid=”1181” usertitle=”Lucille – click here to go to my blog”> Lucille</a>: Hey, Tom. I’m new here. How do I respond to your comment?</div>
<div class=”comment” id=”446“><a href=”/member.php?memberid=”201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Congratulations, Lucille, you just did!  Welcome to the community.</div>
</div></div></div>


That may look like a cluttered mess, but if you look carefully you can find important information.  Some of that information is the content that users write.   Other pieces of information track posts, comments and users by number or name. These names and numbers (indicated in red above) can be thought of as social science variables, and encouragingly they’re placed in predictable locations in a web page:

variable preceded by followed by
post id <div postid=” “><div >
comment id <div id=” “><a href=”/member.php?
member id member.php?memberid=” ” usertitle=”
member name  usertitle=”  – click here to go to my blog

There should be a set of rules for finding these predictable locations, and my goal in data mining is to explain those rules in a computer program that automatically reads many pages on a website, much faster than I can read them.  In English, the rules would look like this:

“Find text that is preceded by [preceding text] and is followed by [following text].  This text is an instance of [variable name].”

Unfortunately, computers don’t understand English.  I am familiar with a language called PHP that can read lines of a web page.  I didn’t know of a command in PHP that would let me carry out the rule described above.  What to do?  Ask a friend.  I asked a friend of mine with a PhD in Computer Science if he could identify such a command in PHP. His answer: “Well, you don’t want to use PHP. The first thing to do is teach yourself Perl.” The Perl programming language, he went on to explain, has much more efficient and flexible approach to handling strings as variables, and if I was going to be serious about data mining efficiently, I should use Perl.

I can’t tell you how many times some computer science expert has told me I shouldn’t follow a path because it was “inelegant” or “inefficient.”  Well, that may be wonderful advice for professional computer programmers who have to design and maintain huge information edifices, or to those who have a few extra semesters to spare in their learning quest, but in my case I say a hearty “Baloney!” to that.  Research does not need to and often cannot wait for the most efficient or elegant or masterful technique to be mastered.  Sometimes the most important thing to do is to get the darned research done.

In my case, this means that I’m going to use PHP, even though it may not be elegant or efficient or flexible or have objects to orient or [insert computer science tech phrase here].  I’m going to use PHP because I know it and it will — clumsily or not — get the darned job done.  Good enough may not be perfect but it is, by definition, good enough.  As long as the result is accurate, I can live with that.

 

A Rough but Ready Method for Extracting Variables from Web Pages with PHP — Explode!

It took a bit of reading through PHP’s online manual, but eventually I found a method that works for me — the “explode” command.  In what follows, I’m going to assume that you are familiar with PHP.  If you aren’t, that’s OK — you’ll just have to find another way to extract information out of a web page.

The PHP command “Explode” takes a string — a line of text in a web page — and splits it into parts.  “Explode” splits your line of text wherever a certain delimiter is found.  A delimiter is nothing more than a piece of text you want to use as a splitting point.  Let’s use an example, the web page snippet listed above:


<div class=”post” postid=”32″><div class=”comments”><a name=”comments”></a><h3>3 Comments on “Lucille’s First Blog Post”</h3><div class=”commentblock”>

<div class=”comment” id=”444″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

<div class=”comment” id=”445″><a href=”/member.php?memberid=”1181″ usertitle=”Lucille – click here to go to my blog”> Lucille</a>: Hey, Tom. I’m new here. How do I respond to your comment?</div>

<div class=”comment” id=”446″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Congratulations, Lucille, you just did! Welcome to the community.</div>

</div></div></div>


Let’s say I’d like to look through 5,000 web pages like this, representing 5,000 individual blog posts.  In each of these 5,000 web pages, the particular post id and comment ids and member ids may change, but the places where they can be found and the code surrounding them remain the same.  We’ll use the code surrounding our desired information as delimiters.

To get really specific, let’s say I’d like to extract a member id number from the above web page every place it occurs.

The first step is to find a line of the web page on which a member id number exists.  To do this, I’ll use the stristr command, which searches for text. The command if (stristr($line, ‘?memberid=’)) {…} takes a look at a line of a website ($line) and asks if it contains a certain piece of text (in this case, ?memberid=).  If the piece of text is found, then what ever commands inside the brackets { } are executed.  If the piece of text is not found, then your computer won’t do anything.

So far, we have:

if (stristr($line, ‘?memberid=’))
{

}

What goes inside the brackets?  Some exploding!  Our first line of code inside the brackets tells the computer to split a line of website code using the text memberid= as the delimiter.

$cutstart = explode(‘memberid=’, $line);

This leaves a line of website code in two pieces, with the delimiter memberid= removed.  Those two pieces are set by the explode command to be $cutstart[0] and $cutstart[1]:

Original line of text: <div id=”444″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

$cutstart[0]: <div id=”444″><a href=”/member.php?

$cutstart[1]: “201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

Where’s the member id number we want?  It’s the number right at the start of $cutstart[1], sitting in between the double quotation marks.  To get at that, let’s add another line of code to explode $cutstart[1] which tells the computer to split $cutstart[1] into pieces at the spots where there are double quotation marks.  The command in the second line of code inside the brackets is:

$cutend = explode(‘”‘, $cutstart[1]);

and takes $cutstart[1] apart, turning it into the pieces $cutend[0]$cutend[1], $cutend[2], $cutend[3] like so:

original $cutstart[1]: “201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

$cutend[0]: 201

$cutend[1]: usertitle= 

$cutend[2]: Tim – click here to go to my blog

$cutend[3]: > Tim</a>: Greetings! How are you, Lucille?</div>

Which part am I interested in?  Only the member id number, and finally that’s what I’ve got in $cutend[0].  If I want, I can rename it to help me remember what I’ve got:

$memberid = $cutend[0];

Taken all together, the code looks like this.


if (stristr($line, ‘?memberid=’))
{
$cutstart = explode(‘memberid=’, $line);
$cutend = explode(‘”‘, $cutstart[1]);
$memberid = $cutend[0];
}


This may not be the most elegant or efficient solution, but it’s pretty simple — and most importantly, gosh darn it, it works.  A novice data miner like me will never get hired away by Google for basic programming like this, and if you’re a social scientist with mad programming skills you may scoff at the elementary nature of this step.  That’s OK; this isn’t written for the Google corporation or wicked-fast coders.  I wrote all this out because the code was a big step for me in becoming a better, more complete social scientist.  If you’re looking to take the same step, I hope this post helps you along.

Credit goes to Tizag for helping me to understand the “explode” command a bit better. In turn, if you can think of a way for me to explain this more clearly or fully, please let me know by sharing a comment.

Leave a Reply

Your email address will not be published. Required fields are marked *