Monday, April 13, 2009

Open Ed. Quest 5 -- Searching for a Better Way (to Search)

Quest 5


"Many BYU faculty already openly share their syllabi and other course materials on personal websites, through iTunesU, and through other mechanisms ... Find as many of the open educational resources being shared by BYU faculty as you can..."

It seems to me that discoverability is really going to be the ultimate make-or-break hinge issue for OER.  One could produce world class, high quality OER that trumps everything that any institutional OER effort produces, and yet remain in complete obscurity with no hope of ever actually sharing these wonderful OER with anyone at all.  And after all, if you take the time and trouble to make some kind of resource with openness in mind, it seems silly to have it be completely worthless (or at least, gravely underused) in the end because you weren't able to put it somewhere that people would find it.

This post isn't going to discuss the hows and whys of publishing open educational content for maximum discoverability. We'll save that for another time.  However, Quest 5 gives us the specific assignment to comb over BYU's web presence looking for faculty-produced OER content, and it begs the question, "How would one go about finding all of the OER on a university's web space?"

The task is not trivial.

What is "Open" ?


The first question that came to mind was how to define "open" as far as university faculty are concerned.  Many college and university professors post all sorts of wonderful educational content on their personal web pages, with the knowledge that outsiders will occasionally stumble onto the site and (hopefully) find some of their content useful.  I imagine that many people on the web would just be flattered if someone came to them and asked to use their content as material in a course.  ("Flattered" may not entirely be the right word if the visitor just ripped it off without permission and used it for commercial purposes, but let's leave that for now and just assume pure and benign motives here.)

The point is that, because of the publicly accessible nature of the web, putting a collection of materials on your website is done with the understanding that someone may eventually come along and make use of the things that you posted.  With that in mind, you go ahead and post everything you've got to your website, hoping that someone will like it and find it useful.  As far as you are concerned, putting it on your website makes it, in a way, "open."

And this is the first problem we run into while tracking down OER.  For some people the simple precondition of an objects existence in an Internet-accessible location makes it essentially "open."  And unfortunately, in addition to being "open," it is also completely indistinguishable from all the other Internet-accessible NON-open content that is out there.

The Need for Creative Commons (or Something Similar)


So really, if you how will anyone find your materials in the first place?  The web is a big place.  Do you really think that your website is the only one with good instructions for underwater basket weaving?  Is yours the only one with autographed pictures of Canadian Royal Mounted Policemen?  Is yours the only one that describes different types of fly-fishing rods and reels?  Or cheese?  How will your website be differentiated from the thousands of other sites out there that are exactly like it?

One good place to start is Creative Commons.  That is, if you really intend for your content to be open, you should mark it with a specific license that meets your needs and level of openness.  In addition to letting everyone know exactly how your content may be used, posting a CC license on your page has some additional benefits:  Google Advanced Search.

Google Advanced Search seeks to provide a (partial) answer to the discoverability problem as it now allows you to search for pages that indicate specific usage rights.  From the Google website:

[caption id="attachment_211" align="alignnone" width="652" caption="Google describing the usage rights advanced search feature"]Google describing the usage rights advanced search feature[/caption]

So, Google will provide a list of web pages that contain your search query and carry the Creative Commons HTML code indicating the type of usage rights you are looking for.  Brilliant!  If I post my open content with some CC license code, then my OER is much more likely to be discovered by someone using Google who is looking specifically for CC-licensed content.

Searching BYU through Google


Having discussed all of this, let's look again at our assignment for find all of the "open" educational resources on the BYU site.

First, a quick query from Google indicates that the byu.edu domain contains over 1 million webpages:

site:byu.edu
Results 1 - 10 of about 1,120,000 from byu.edu. (0.03 seconds)


Now, searching byu.edu for all CC-licensed material:

site:byu.edu
Results 1 - 100 of about 1,510 from byu.edu. (0.15 seconds)


Hmmm... so, of about 1.1 million web pages, Google tells me that only about 1,500 pages have some form of CC license HTML code on them.  And actually, if I follow through to the last result page, it says there are now "about 1,110" pages.  I tell it to "repeat search with omitted results included" and there are now "about 1,120" pages.  Following to the last page again, I find that, in fact, there are only 890 actual pages.

890 out of 1,120,000?  Who knows if the 1.1M estimate is correct? If it is correct, then 890 / 1,120,000 = 0.079% of BYU pages have a Creative Commons license on them.  Let's say that there are actually only half of that number of total pages, about 600,000 pages at byu.edu.  890 / 600,000 = 0.148%.

According to Google, about 1/10th of one percent of all byu.edu pages have Creative Commons HTML license code on them.

I don't believe that there aren't more people at BYU who want to openly share their content.  However, I do believe that these people don't know what a Creative Commons license is, or why they should use one.

Shell Scripting Fun


I created a Linux bash shell script to help me parse through the search results from Google.  This script takes a set of html pages (Google search results that I saved from Firefox to a folder on my hard drive) and searches through them for links to a specific domain.

So, for example, I search Google for all CC content pages at byu.edu.  I saved those results (100 at a time) to a folder on my hard drive.  I ran this script, and it identified each Google search result by looking for all links containing "http://something.whatever.byu.edu/".  The script then counts all search result links for a specific byu.edu subdomain.

Here's the breakdown of the 890 CC search results from Google by subdomain:

726 http://rhetoric.byu.edu/
73 http://open.byu.edu/
33 http://classes.eclab.byu.edu/
18 http://www.et.byu.edu/
16 http://morse.cs.byu.edu/
9 http://ilab.cs.byu.edu/
6 http://humanities.byu.edu/
2 http://csl.cs.byu.edu/
2 http://blogs.eclab.byu.edu/
1 http://www.math.byu.edu/
1 http://www.eclab.byu.edu/
1 http://synapse.cs.byu.edu/
1 http://reliability.ee.byu.edu/
1 http://ccl.ee.byu.edu/

rhetoric.byu.edu accounts for about 82% of all Google search results for CC licensed content

These 890 search results belong to 14 distinct subdomains.  In another Google search (for all content on byu.edu, not just CC pages) the first 1,000 results represent 497 distinct subdomains.  There are probably more.  So,

14 / 497 subdomains = 3% max. of byu.edu subdomains have CC HTML license code somewhere on their site.

More difficult than anticipated


This information is fairly dismal for our prospects of magically finding all OER content for an institution through Google.  Google Advanced Search tells me that about 1/10th of 1 percent of web pages in the byu.edu domain have Creative Commons license code on them. Furthermore, one individual's site (rhetoric.byu.edu) accounts for nearly 85% of these pages.  If we consider this one person's site to be an outlier and temporarily discard it from our analysis, we end up with (according to Google numbers) 164 out of 1,120,000 pages with CC license code, or 15 hundred-thousandths of a percent.

Whatever the numbers are, the point is this:  There have got to be more people than this who want to share their content, but they probably just don't know anything about Creative Commons or other open licensing options.

What's next?


So, how can we find all the OER content at byu.edu?

Well, there's always the old-fashioned way.  As a class, we attempted to find all of this content by searching for "BYU" or "Brigham Young University" on sites that are known to have open content, such as iTunesU, YouTube, Flickr, etc.  It is also possible that you could do a site:byu.edu search on Google looking for keywords that you feel are more likely to lead you to open content (e.g. "resources", "links", "tutorials", etc.).

Or, maybe we could just do some CC evangelism in our respective communities and encourage people we know to start using CC licenses? Then we can find those resources with Google just fine.

And yet, I'm still intrigued by the idea of finding it all automagically.  But what would that require?  Probably some supervised machine learning to create a statistical model of what words or clusters of words are statistically significant when found in the vicinity of content that people typically consider to be "open."  That's not going to happen this semester.

Then there's my script.  I had larger plans for it originally, but it too fell victim to end-of-semester time constraints.  It's not worthless, however.  I was able to use it to parse through Google search results and show me the breakdown of CC licensed content across byu.edu subdomains.  If nothing else, this helps me identify people and organizations that definitely do want to share some of their content openly.  If I can find these people, maybe they can point me to other people who also want to share their content openly, but don't yet know about open licenses.  Maybe we can start encouraging these people in groups of similar interests to adopt open licenses.  And this brings me to my next point, which is,

Networking


Some time ago, I remember David Wiley quoting someone at some conference somewhere (I'll get you a reference later) as asking, "What if I don't want to share my content with the world?  What if I just want to share it with the guy down the hall?"

This makes sense to me.  People that are close to you, either personally or professionally, are likely to have something in common with you.  At work, your coworkers are going ot be working on the same (or similar) projects that you are working on.  At home, you will have freinds and family members who share common goals and interests with you.  In my opinion, these are situations in which content sharing is much more likely.  You need some kind of common connection.

Enter social networking.  Can we leverage the power of social networks to augment the discoverability of OER?  Can we also use the power of social networks to classify and rate OER?  There has got to be some way to distribute this discoverability problem over the distributed mental capacity of millions of people.and their daily social interactions.

Or will we just keep relying on Google to help us stumble onto the content we're looking for?

2 comments:

  1. [...] Home « Open Ed. Quest 5 — Searching for a Better Way (to Search) [...]

    ReplyDelete
  2. Good stuff. I like your approach (enhancing existing search functions) over the OER repository or registered index approach (seems like that's how web searching worked ... until Google), and hope to put some thought against the problem later today.

    ReplyDelete