NS

Naren Salem

Creating a search engine

21 April ’16

Why create one?

We are using CanvasLMS to host an online course that we have submitted to the Texas Education Agency (TEA) for review. Part of the review requirement is that the course has a built in word search feature. But CanvasLMS (atleast the hosted version) does not come with the search feature.

For my first attempt I used the Google Custom Search. I went with this while we were still in development. I could create a custom search engine that would only return results from pages matching our course URL. And this page could be embedded within the Canvas course using a third party plugin. All looked OK.

But as we got closer to releasing our course, a couple of problems surfaced. The “search” part of the search was fine, the problem was with the indexing part. We wanted our course to be behind a login to protect our material. This meant that the google indexing wasn’t going to be able to find our pages. I could have opened up the password protection for short periods of time (during the night) if I knew when Google was going to index our course.

Google gives you a little bit of control over on-demand indexing for websites that you own. Since we were hosting our course on the public Canvas site, we did not have that luxury.

Inspiration

I stumbled upon this blog that went into the step by step instructions on creating a internet search engine. The fact that you could implement a basic search engine with just a little bit of code gave me the impetus to get started with creating my own. I was also hoping that I could just clone this project and get started with that.

I realized a couple of things about this project very soon. One - we don’t need to re-invent the wheel. This blog post was about learning the details of how a basic search engine would work. That was great. But for my task, I would be better off using off the shelf modules that already did things like indexing, finding and ranking very well, rather than re-invent a very rudimentary wheel.

Another option

So I looked into Solr. It seemed like I could get a search engine set up within 5 minutes and use that as the basis of my implementation. The problem with this approach was that Solr (and Lucene) are a huge, full featured search solutions that had a fairly high setup cost in terms of my time. I wanted to implement something in a couple of days.

Enter PostgreSQL

At this point I had a much better definition of what I wanted to create, and the design fell into place.

  1. A crawler that would pull down the about 500 pages that I needed to be indexed
  2. A database to store the relevant sections of the web pages indexed by the page URL
  3. The search code to find the pages that matched the search terms
  4. Front end to get user input and display results.

And along with this came an a-ha moment. A quick check confirmed that the PostgreSQL full text search option had some built in functions to create a pretty good search system for my step 3.

I also made sure that I would be able to get this working on Heroku where I intended to host my app.

Learning to crawl

The problem with the pages I wanted to index was, they didn’t want to be indexed. The content was being loaded with JavaScript, so getting the page with basic HTTP libraries was not returning the content. Since I wanted to get a quick implementation and I only had a few web pages, I figured, I could automate the browser to open the pages and then save to disk.

The best option was this script this I found. This was fully functional as it was and it worked like a charm.

“Indexing”

Once I had the files downloaded, it was easy enough to get the data uploaded to PostgreSQL. I only needed to save the page title and the div.user_content element contents.

def add_to_index(url)

  File.open('./data/'+url.split('/').last+'.html', "r"){ |doc|
    h = Hpricot(doc)
    title, body = h.search('title').text.strip, h.search('div.user_content')
    %w(style noscript script form img).each { |tag| body.search(tag).remove}

    array = []
    body.first.traverse_element {|element|
      if element.text? then
        text = element.to_s.strip
        if text != "" then
          array << text
        end
      end
    }

    print "Adding page #{url} #{title}\n"
    @conn.exec_prepared('insert_page', [url, title, array.join(" ")])

  }
rescue
  print "Failed to process file #{url}\n"
end