Finding and Extracting Variables from Web Pages with PHP: A How-to for Social Scientists in the Rough

“Data Mining”: Just Another Way for Social Scientists to Ask Questions

If social science is the study of the structure of interactions, groups and classes, and if interactions, groups and classes are increasingly tied to the online environment, then it is increasingly important for social scientists to learn how to collect data online. Fortunately, the approach to “data mining” online interaction is fundamentally the same as the approach to studying offline social interaction:

  1. We approach the subject,
  2. We query the subject, and
  3. We obtain variables based on the responses we’re given.

Because the online environment and our online subjects are different, the way we make online queries must be different from the way we make offline queries. In data mining we don’t question human beings who can flexibly interpret a question; instead, we question computers responsible for the architecture of the online social system, and they will only respond if questioned in precisely the right way.

 

Learning to Mine the Web for Social Data — Without a Computer Science Degree

I’ve been trying to learn how to mine social information from websites on my own, without the benefit of any formal education in computer science.  This is kind of fun even when it’s frustrating, as long as I remember that getting information from the online environment is like solving a puzzle.  On most websites, social information (relations, communications, and group memberships) is stored in a database (like XMLSQL or JSON); some content management software (like WordPress, Joomla or Drupal) takes the information stored in a database and posts it on web pages, surrounded by code that makes the information comprehensible to humans like you and me.  If websites are researcher-friendly, they allow databases to be queried directly through an Application-Programming Interface (API).

Many websites don’t let a person query their databases, even when all the information published on those websites is public.  What’s a social scientist to do?  Well, we could literally read each single web page, find the information about relations, communications and group memberships we’re interested in, write down that information, and enter it into our own database for analysis.  We could do this, hypothetically, but at the practical scale of the Internet it’s often impossible.  Manually collecting interactions on a website with 10,000 participants could take years — and by the time we were done, there would be a whole new set of interactions to observe!

Fortunately, because web pages on social websites are written by computers, there are inevitably patterns in the way they’re written.  Visit a typical page on a social media website and use your browser’s “View source” command to look at the raw HTML language creating that page.  You’ll find sections that look like this:


<div class=”post” postid=”32“><div class=”comments”><a name=”comments”></a><h3>3 Comments on “Lucille’s First Blog Post”</h3><div class=”commentblock”>
<div class=”comment” id=”444“><a href=”/member.php?memberid=”201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>
<div class=”comment” id=”445“><a href=”/member.php?memberid=”1181” usertitle=”Lucille – click here to go to my blog”> Lucille</a>: Hey, Tom. I’m new here. How do I respond to your comment?</div>
<div class=”comment” id=”446“><a href=”/member.php?memberid=”201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Congratulations, Lucille, you just did!  Welcome to the community.</div>
</div></div></div>


That may look like a cluttered mess, but if you look carefully you can find important information.  Some of that information is the content that users write.   Other pieces of information track posts, comments and users by number or name. These names and numbers (indicated in red above) can be thought of as social science variables, and encouragingly they’re placed in predictable locations in a web page:

variable preceded by followed by
post id <div postid=” “><div >
comment id <div id=” “><a href=”/member.php?
member id member.php?memberid=” ” usertitle=”
member name  usertitle=”  – click here to go to my blog

There should be a set of rules for finding these predictable locations, and my goal in data mining is to explain those rules in a computer program that automatically reads many pages on a website, much faster than I can read them.  In English, the rules would look like this:

“Find text that is preceded by [preceding text] and is followed by [following text].  This text is an instance of [variable name].”

Unfortunately, computers don’t understand English.  I am familiar with a language called PHP that can read lines of a web page.  I didn’t know of a command in PHP that would let me carry out the rule described above.  What to do?  Ask a friend.  I asked a friend of mine with a PhD in Computer Science if he could identify such a command in PHP. His answer: “Well, you don’t want to use PHP. The first thing to do is teach yourself Perl.” The Perl programming language, he went on to explain, has much more efficient and flexible approach to handling strings as variables, and if I was going to be serious about data mining efficiently, I should use Perl.

I can’t tell you how many times some computer science expert has told me I shouldn’t follow a path because it was “inelegant” or “inefficient.”  Well, that may be wonderful advice for professional computer programmers who have to design and maintain huge information edifices, or to those who have a few extra semesters to spare in their learning quest, but in my case I say a hearty “Baloney!” to that.  Research does not need to and often cannot wait for the most efficient or elegant or masterful technique to be mastered.  Sometimes the most important thing to do is to get the darned research done.

In my case, this means that I’m going to use PHP, even though it may not be elegant or efficient or flexible or have objects to orient or [insert computer science tech phrase here].  I’m going to use PHP because I know it and it will — clumsily or not — get the darned job done.  Good enough may not be perfect but it is, by definition, good enough.  As long as the result is accurate, I can live with that.

 

A Rough but Ready Method for Extracting Variables from Web Pages with PHP — Explode!

It took a bit of reading through PHP’s online manual, but eventually I found a method that works for me — the “explode” command.  In what follows, I’m going to assume that you are familiar with PHP.  If you aren’t, that’s OK — you’ll just have to find another way to extract information out of a web page.

The PHP command “Explode” takes a string — a line of text in a web page — and splits it into parts.  “Explode” splits your line of text wherever a certain delimiter is found.  A delimiter is nothing more than a piece of text you want to use as a splitting point.  Let’s use an example, the web page snippet listed above:


<div class=”post” postid=”32″><div class=”comments”><a name=”comments”></a><h3>3 Comments on “Lucille’s First Blog Post”</h3><div class=”commentblock”>

<div class=”comment” id=”444″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

<div class=”comment” id=”445″><a href=”/member.php?memberid=”1181″ usertitle=”Lucille – click here to go to my blog”> Lucille</a>: Hey, Tom. I’m new here. How do I respond to your comment?</div>

<div class=”comment” id=”446″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Congratulations, Lucille, you just did! Welcome to the community.</div>

</div></div></div>


Let’s say I’d like to look through 5,000 web pages like this, representing 5,000 individual blog posts.  In each of these 5,000 web pages, the particular post id and comment ids and member ids may change, but the places where they can be found and the code surrounding them remain the same.  We’ll use the code surrounding our desired information as delimiters.

To get really specific, let’s say I’d like to extract a member id number from the above web page every place it occurs.

The first step is to find a line of the web page on which a member id number exists.  To do this, I’ll use the stristr command, which searches for text. The command if (stristr($line, ‘?memberid=’)) {…} takes a look at a line of a website ($line) and asks if it contains a certain piece of text (in this case, ?memberid=).  If the piece of text is found, then what ever commands inside the brackets { } are executed.  If the piece of text is not found, then your computer won’t do anything.

So far, we have:

if (stristr($line, ‘?memberid=’))
{

}

What goes inside the brackets?  Some exploding!  Our first line of code inside the brackets tells the computer to split a line of website code using the text memberid= as the delimiter.

$cutstart = explode(‘memberid=’, $line);

This leaves a line of website code in two pieces, with the delimiter memberid= removed.  Those two pieces are set by the explode command to be $cutstart[0] and $cutstart[1]:

Original line of text: <div id=”444″><a href=”/member.php?memberid=”201″ usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

$cutstart[0]: <div id=”444″><a href=”/member.php?

$cutstart[1]: “201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

Where’s the member id number we want?  It’s the number right at the start of $cutstart[1], sitting in between the double quotation marks.  To get at that, let’s add another line of code to explode $cutstart[1] which tells the computer to split $cutstart[1] into pieces at the spots where there are double quotation marks.  The command in the second line of code inside the brackets is:

$cutend = explode(‘”‘, $cutstart[1]);

and takes $cutstart[1] apart, turning it into the pieces $cutend[0]$cutend[1], $cutend[2], $cutend[3] like so:

original $cutstart[1]: “201” usertitle=”Tim – click here to go to my blog”> Tim</a>: Greetings! How are you, Lucille?</div>

$cutend[0]: 201

$cutend[1]: usertitle= 

$cutend[2]: Tim – click here to go to my blog

$cutend[3]: > Tim</a>: Greetings! How are you, Lucille?</div>

Which part am I interested in?  Only the member id number, and finally that’s what I’ve got in $cutend[0].  If I want, I can rename it to help me remember what I’ve got:

$memberid = $cutend[0];

Taken all together, the code looks like this.


if (stristr($line, ‘?memberid=’))
{
$cutstart = explode(‘memberid=’, $line);
$cutend = explode(‘”‘, $cutstart[1]);
$memberid = $cutend[0];
}


This may not be the most elegant or efficient solution, but it’s pretty simple — and most importantly, gosh darn it, it works.  A novice data miner like me will never get hired away by Google for basic programming like this, and if you’re a social scientist with mad programming skills you may scoff at the elementary nature of this step.  That’s OK; this isn’t written for the Google corporation or wicked-fast coders.  I wrote all this out because the code was a big step for me in becoming a better, more complete social scientist.  If you’re looking to take the same step, I hope this post helps you along.

Credit goes to Tizag for helping me to understand the “explode” command a bit better. In turn, if you can think of a way for me to explain this more clearly or fully, please let me know by sharing a comment.

Gun Violence Fact Check: the New Normal is Lower than the Old Normal

On Bill Moyers’ website, Cliff Schecter asks readers to “Say No to ‘The New Normal’ — Five Things You Can Do About Gun Violence.” This headline implicitly asserts that the level of gun violence being currently experienced is new — a “new normal.” The first paragraph of Schecter’s article implies that gun violence is not at a “new” low, but rather a “new” high:

“Some days it can seem like we should just give up. You’re just processing one senseless mass shooting in Las Vegas when you find out there has been a mass killing in Florida. But there’s no time to think about that because your television is saying that there’s a shooter on the loose in North Hollywood, and there has been another high school shooting in Oregon. It can lead to despair.”

Notice that Schechter’s claim about a rise in gun deaths to an unacceptable “new normal” is based in the way things “seem” to citizens based on reports in the news media. But perceived changes in reports in the news are not the same as observed changes in the event itself. To uncover the latter, let’s consult historical data from the FBI’s compilation of police reports in the 1996, 2000, 2004, 2008 and 2012 Crime in the United States publication. Although police reports are not a perfect measure of actual levels of crime, homicide is perhaps the one crime most transparent to police reports and is most likely to gather data regarding the mode of death. The Crime in the United States publication brings us close as possible to charting the actual trend in the number and type of homicides in the United States. Counts of murders involving firearms from 1992 to 2012 (the last year for which data is available) are presented below:

Number of Murder Victims Killed by Firearms.  Source: FBI, Crimes in the United States reports

According to these records of police reports, there has been a decline rather than an increase in murder by gun in the United States. That decline in numbers is striking considering that during the same period the U.S. population rose. On January 1, 1992 the United States population was 254,782,555. On December 31, 2012 the United States population was 315,073,604. If the rate of gun deaths per person remained absolutely the same from 1992 to 2012, we would have seen an increase in the number of gun deaths of 23.7% simply because in 2012 there were 23.7% more people. Instead, the UCR figures show a drop of 42.8%, from 15489 murders by firearm in 1992 to 8855 murders by firearm in 2012.

It could be reasonable to argue that 8855 gun-related killings in the United States are still too many, and it could be reasonable to discuss what might be done to reduce the number of gun deaths even more. On the basis of nationally-collected data, however, it is not reasonable to assert that levels of gun violence are on the rise to some unacceptably high “new normal.”

School Shootings and the Thomas Theorem: Real in their Consequences

“If men define situations as being real, then they are real in their consequences.” — W.I. Thomas and Dorothy Swain Thomas, 1928, The Child in America

The Thomas Theorem is a staple of sociological insight, asserting that beliefs have a force of their own, whether those beliefs are factually accurate or not. If people believe something to be true, and act on the basis of that belief, then that belief can have observable consequences in the world, even when a belief is false.

As a case in point, consider our beliefs about the risk to children from school shootings. A regular Gallup Poll of the parents of school aged children shows that a significant portion of Americans with school-aged children have consistently expresed fear for their children’s safety in school, and that the share of parents expressing such fears has increased in recent years:

Gallup Poll Question: Thinking about your oldest child, when he or she is at school, do you fear for his or her physical safety?

We are particularly worried about our children being shot and killed by aggressors packing guns. A typical Facebook discussion not only characterizes the current level of school shootings as “too many” but asks, “How many more innocent people are going to die before they actually do something worthwhile to stop this?” Tennessee resident Kali asks, “Are our children safe at school? School shootings are everywhere with no school exempt. Where or what security measures have been taken to ensure our child’s safety?”

In American culture, we believe that our children are at significant risk in school. But at the same time that parents’ concern runs high, the actual count of children killed in school shootings runs low and, if anything, sunk even lower in recent years. As the latest report of the Bureau of Justice Statistics shows, the in-school homicide rate in 2010-2011 was 0.00002 percent, which means a child’s chance of being killed in school was 1 in 4.5 million.

In response to these kinds of calls, Hardwire Armor Systems LLC has been marketing bulletproof whiteboards to schools. Its 18×20 model is meant to be used as a standard handheld whiteboard, but features handles on the back and is designed to protect a teacher facing a school shooter from “multiple magazines of ammunition from handguns or shotguns without ricochet or injury.” The most basic 18×20 bulletproof whiteboards Hardwire offers costs $399 — $383 more per whiteboard than a standard whiteboard of the same size.

Is this extra expenditure in every classroom worth it? Hardwire Armor Systems LLC says so, and offers glowing testimonials from students like 9th grader Tori Barros to explain why:

“Recently, there has been a lot of heartbreaking and frightening news about school safety. School is a place to think and be safe. The thought that someone can just walk into my classroom and take my life away is terrifying. My father says, ‘We can’t predict the future, all we can do is be best prepared for uncertain events.’ No one has a crystal ball that can tell us if a bad person is coming to our school. But knowing that Hardwire is providing my classmates and I with the protection that may someday save our lives is a huge relief.”

Some schools are buying.

In comparison to the 1 in 4.5 million risk that a child will be killed in school during a year, in any year 1 out of 60 thousand schoolchildren aged 5-14 die from heart attacks. If a 1 in 4.5 million risk is enough to place a bulletproof whiteboard in every classroom, is a 1 in 60 thousand risk enough to place a defibrillator in every classroom? Although defibrillators are being bought by schools, they’re being placed at the rate of just 1 or 2 per sprawling school, not within arm’s reach of a teacher. Why, when the risk of a student dying from heart disease is 75 times as great as the risk of a student dying from a school shooting? The answer is that we have collectively decided to pay more attention to some childrens’ deaths than others. Measured in terms of our focus on the issue in the media and measured in terms of our efforts to implement a solution, the death of a child from a shooting is more socially meaningful to us than the death of a child from a heart attack.

Even if we ignore other causes of child deaths and limit ourselves to child murders, we can still ask why school shootings attract as much attention as they do. After all, children are killed in much greater numbers in other contexts. Indeed, as the Bureau of Justice Statistics’ recent report shows, only 8 out of every 1,000 child murders occur in the context of school:

Where are children typically killed? The Centers for Disease Control and Prevention gathers information on child homicides from 17 states in the U.S. using its National Violent Death Reporting System, and the results can be browsed suing the CDC’s Web-based Injury Statistics Query and Reporting System (also known as WISQARS). During the last year for which data is available (2011), the following are the places where children were murdered:

Place of Injury Percentage
All 100.00%
House, apartment, including driveway, porch, yard 47.32%
Residential institution, including shelter, prison <3%
Transport area: public highway, street or road 27.13%
Transport area: other, including inside motor vehicle 8.52%
Recreational area, cultural area or public building <3%
Commercial area (non-recreational), including medical service area, farm, industrial or construction area 3.47%
Natural area/countryside 4.42%
Other specified place, including school, sports or athletics area 3.47%
Unknown 3.47%

 

If nearly half of all murders of children take place inside the home (many times more than take place in school), then shall we invest in bulletproof doors or suffocation-safe pillows? If more than a quarter of the murders of children occur on the street, then mightn’t we consider detachable bulletproof street signs as a security option?  Such solutions sound absurd to our ears, but these absurdities are an artifact of how we define the situation: if we were strictly interested in preventing the greatest number of child murders, we might want to consider in-home or on-street changes much more seriously than we consider in-school changes.  The fact that we devote our concern to security in schools despite the relative safety of schools tells us that our concern is a consequence of the way we construct social reality and the stories we tell about danger.  The stories we tell, based on our out-of-proportion beliefs, have real consequences: as teachers hoist bulletproof whiteboards in their classrooms, children die in their homes.

1 2