Naren Salem

Creating a search engine

21 April ’16

Why create one?

We are using CanvasLMS to host an online course that we have submitted to the Texas Education Agency (TEA) for review. Part of the review requirement is that the course has a built in word search feature. But CanvasLMS (atleast the hosted version) does not come with the search feature.

For my first attempt I used the Google Custom Search. I went with this while we were still in development. I could create a custom search engine that would only return results from pages matching our course URL. And this page could be embedded within the Canvas course using a third party plugin. All looked OK.

But as we got closer to releasing our course, a couple of problems surfaced. The “search” part of the search was fine, the problem was with the indexing part. We wanted our course to be behind a login to protect our material. This meant that the google indexing wasn’t going to be able to find our pages. I could have opened up the password protection for short periods of time (during the night) if I knew when Google was going to index our course.

Google gives you a little bit of control over on-demand indexing for websites that you own. Since we were hosting our course on the public Canvas site, we did not have that luxury.


I stumbled upon this blog that went into the step by step instructions on creating a internet search engine. The fact that you could implement a basic search engine with just a little bit of code gave me the impetus to get started with creating my own. I was also hoping that I could just clone this project and get started with that.

I realized a couple of things about this project very soon. One - we don’t need to re-invent the wheel. This blog post was about learning the details of how a basic search engine would work. That was great. But for my task, I would be better off using off the shelf modules that already did things like indexing, finding and ranking very well, rather than re-invent a very rudimentary wheel.

Another option

So I looked into Solr. It seemed like I could get a search engine set up within 5 minutes and use that as the basis of my implementation. The problem with this approach was that Solr (and Lucene) are a huge, full featured search solutions that had a fairly high setup cost in terms of my time. I wanted to implement something in a couple of days.

Enter PostgreSQL

At this point I had a much better definition of what I wanted to create, and the design fell into place.

  1. A crawler that would pull down the about 500 pages that I needed to be indexed
  2. A database to store the relevant sections of the web pages indexed by the page URL
  3. The search code to find the pages that matched the search terms
  4. Front end to get user input and display results.

And along with this came an a-ha moment. A quick check confirmed that the PostgreSQL full text search option had some built in functions to create a pretty good search system for my step 3.

I also made sure that I would be able to get this working on Heroku where I intended to host my app.

Learning to crawl

The problem with the pages I wanted to index was, they didn’t want to be indexed. The content was being loaded with JavaScript, so getting the page with basic HTTP libraries was not returning the content. Since I wanted to get a quick implementation and I only had a few web pages, I figured, I could automate the browser to open the pages and then save to disk.

The best option was this script this I found. This was fully functional as it was and it worked like a charm.


Once I had the files downloaded, it was easy enough to get the data uploaded to PostgreSQL. I only needed to save the page title and the div.user_content element contents.

def add_to_index(url)

  File.open('./data/'+url.split('/').last+'.html', "r"){ |doc|
    h = Hpricot(doc)
    title, body = h.search('title').text.strip, h.search('div.user_content')
    %w(style noscript script form img).each { |tag| body.search(tag).remove}

    array = []
    body.first.traverse_element {|element|
      if element.text? then
        text = element.to_s.strip
        if text != "" then
          array << text

    print "Adding page #{url} #{title}\n"
    @conn.exec_prepared('insert_page', [url, title, array.join(" ")])

  print "Failed to process file #{url}\n"

Game Engines for the Web

13 March ’16

My son likes to play some of the casual games on Facebook. I am not a big fan of the ads he ends up watching to play the game. I don’t know if that bothers me more or less than the fact these games are made using flash.

So I figured what if I could make some of these simple games - without ads, and a lot less gaudy.

I started my search for game engines here. This was such a well written article I was ready to believe whatever they were going to say… but I also looked at this comparison tool.


Best thing about Crafty seems to be its component oriented design. The Build New Games link above describes it well. This might actually be a good place for me to start. Need to compare this with CreateJS.


From the Build New Games review

Lime is a very “safe” choice in that it will for sure be able to handle just about any 2D game you could throw at it. But Lime is also a bit tedious at times, and also sometimes very general. Lime can be seen more as a “game framework” from which you could build specific game engines on top of. One area that Lime really excels is its node graph. Just about everything in Lime derives from the base Node class, and nodes can contain child nodes. This allows you to build complicated entities that can have many children, yet address them from a high level. For example you might add a character node as a child of a vehicle node it is riding in. The character node might also have an armor node that he is wearing. Whenever the parent vehicle node moves, the character and his armor will move right along with it, making them act as a cohesive whole.


Another good looking open source engine, with physics support. Check out this tutorial for creating a space invaders clone. Melon also seems to have more recent activity than some of the others.


Mainly suited for 2D Platformers and action games. Plus it is not free. So maybe this is not going to be the first engine I try.


It seems like these libraries give you drawing tools, but the interactions have to be provided by the developer. This library may be great for animated websites, I think!

Wild Kratts games with CreateJS - http://pbskids.org/wildkratts/games/

With AngularJS

Not much on details here (which isn’t a bad thing), but a good inspiration for connecting up a game engine with Angular.


Not gonna work out


Facebook Comment Plugin in Rails

13 March ’16

For the video website I am creating we wanted to add some commenting options. The Facebook Comments plugin works nicely for our needs. It is easy to add. Looks half decent. And the visitors don’t need to create yet another account (if they have Facebook).

But I was running into a problem that when the page was loaded the comment plugin would not show, until I refreshed the page. Turns out there may be a problem with Turbolinks and JavaScript. I found the solution on StackOverflow, but that didn’t have much detail. I will probably run into something else that will provide more insight soon.

Read this on StackOverflow here.

Website Accessibility

10 March ’16

To submit an online course for review with the TEA, we need to comply with a host of requirements. One of them, understandably, relates to accessibility. Their requirement is that we satisfy these two guidelines:

  • W3C Web Content Accessibility Guidelines (WCAG) 2.0 Level AA - https://www.w3.org/WAI/Resources/
  • Federal Law - Section 508 - http://www.section508.gov/

Most of the guidelines and requirements are pretty common sense. Make the site readable, make sure it will work well with screen readers, make sure it won’t cause seizures. That kind of stuff.

There were a couple of online tools that I ran our site on. The cleanest one was http://www.cynthiasays.com/ The good-ish news is that in terms of our content we were flagged for two things.

  • Italics
  • Missing alt text for images

That’s not too bad. Mainly because we can easily fix these. I am not sure if the TEA will use some tool for their testing or if it will be up to reviewers.

Unfortunately we got pulled up for some other stuff as well. The testing tools flag some things with the Canvas framework. These are for some of their menu items and hidden elements. These flags range in severity with one or two showing up as ‘must fix’.

Canvas has a forum post that brags about how their site is designed for accessibility. I wonder what they have to say about the things that were detected by the online tools.

To explore this further I ran the tool on the following government websites

section508.gov (the accessibility standards website) whitehouse.gov irs.gov iCEVonline.com

And they all failed with similar issues to the ones on Canvas.

If our course is approved, we are required to get an independent body to certify the accessibility of our course website. This is due on May 5, 2017.

Search in Canvas LMS

9 March ’16

Frustratingly CanvasLMS does not provide a way to search within a course. Who doesn’t these days. I found a (short) thread on their forum which did provide a simple solution - Google Custom Search.

Here is the discussion on the CanvasLMS community board.