Skip to content

Instantly share code, notes, and snippets.

@NickyBobby
Last active June 12, 2016 15:11
Show Gist options
  • Select an option

  • Save NickyBobby/a47b6749653c8b26b10be28aa33a2ce8 to your computer and use it in GitHub Desktop.

Select an option

Save NickyBobby/a47b6749653c8b26b10be28aa33a2ce8 to your computer and use it in GitHub Desktop.

Feedjira VS Nokogiri

I recently got put in a project for an app that scrapes job sites looking for jobs that use particular technologies. The term 'scrape' means you are searching a document (or website) for a particular match and then taking that information for your own purposes. For example, if you were searching for jobs that use Ruby, you may want to parse through all jobs and for each job check to see if the word Ruby is in it. If it matches then you can grab the entire job description and parse it further for more information. This is exactly what we are doing for this application.

When I stepped into this code base, we were using Feedjira to parse information related to each job. There were some problems with the grabbing of locations. The problem was that Feedjira could not target the location tag directly, it had to be grabbed from the job title using regex. Regex is a super powerful to find pattern matches in strings but it was not being used correctly in this case. So I switched the parser over from Feedjira to Nokogiri, which enabled me to target the location tag directly and grab it without using regex. The next part of this blog is going to be explaining the differences between Feedjira and Nokogiri, and the pros/cons of using each one.

What is Feedjira?

Feedjira (formerly Feedzirra) is a Ruby gem for fetching and parsing RSS feeds. One of the benefits of Feedjira is that it changes the information being parsed into a Ruby object with different attributes. This makes it super easy to use the data being parsed, as you can call on the attributes easily. One of the cons for using Feedjira is that it doesn't recognize all the tags and turn them into attributes. That was the problem with the project that I was in. There was no way to easily grab the location using the Feedjira gem. The best option for finding the location was to check the title of the job and see if the location was inside parentheses using regex.

What is Nokogiri?

Nokogiri is an HTML, XML, SAX, and Reader parser. The main feature of Nokogiri is the ability to search documents via XPath or CSS3 selectors. This especially came in handy when I was trying to locate the location CSS tag for jobs. The pros of using this parser is the flexibility to do different things at once. For example, you can search via XPath or CSS3 tags at to parse separate data at the same time. The only con I can find is that it forces you to be more explicit with your queries. For example:

In Nokogiri you can type title = entry.css('title').text to return "Some Title".

Whereas in Feedjira you can type title = entry.title to return "Some Title".

You have to be explicit when targeting information. This is also why it's so versatile, it allows you to do more things using one gem.

Which is better?

In conclusion, I would say that Nokogiri is the better gem to use if you need to "scrape" up some data. Ultimately it depends on what you're doing and what you're trying to accomplish. If you're scraping a simple RSS feed and only need a few pieces of data, then Feedjira might be a better solution. Nokogiri would be a better option for almost every other type of scraping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment