Parsing HTML, XML, JSON

Generally speaking the best and easiest way for parsing HTML and XML is using Nokogiri library

  • To install Nokogiri

    gem install nokogiri

HTML

Here we'll use nokogiri to list our contents list from http://rubyfu.net/content/

Using CSS selectors

require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://rubyfu.net/content/"))
page.css(".book .book-summary ul.summary li a, .book .book-summary ul.summary li span").each { |css| puts css.text.strip.squeeze.gsub("\n", '')}

Returns

RubyFu
Module 0x0 | Introduction
0.1.  Contribution
0.2.  Beginners
0.3.  Required Gems
1.  Module 0x1 | Basic Ruby Kung Fu
1.1.  String
1.1.1.  Conversion
1.1.2.  Extraction
1.2.  Array
2.  Module 0x2 | System Kung Fu
2.1.  Command Execution
2.2.  File manipulation
2.2.1.  Parsing HTML, XML, JSON
2.3.  Cryptography
2.4.  Remote Shell
2.4.1.  Ncat.rb
2.5.  VirusTotal
3.  Module 0x3 | Network Kung Fu
3.1.  Ruby Socket
3.2.  FTP
3.3.  SSH
3.4.  Email
3.4.1.  SMTP Enumeration
3.5.  Network Scanning
.
.
..snippet..

XML

There are 2 ways we'd like to show here, the standard library rexml and nokogiri external library

We've the following XML file

<?xml version="1.0"?>
<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A scientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

REXML

require 'rexml/document'
include REXML

file = File.read "file.xml"
xmldoc = Document.new(xmlfile)

# Get the root element
root = xmldoc.root
puts "Root element : " + root.attributes["shelf"]


# List of movie titles.
xmldoc.elements.each("collection/movie") do |e|
  puts "Movie Title : " + e.attributes["title"] 
end

# List of movie types.
xmldoc.elements.each("collection/movie/type") do |e|
  puts "Movie Type : " + e.text 
end

# List of movie description.
xmldoc.elements.each("collection/movie/description") do |e|
  puts "Movie Description : " + e.text
end

# List of movie stars
xmldoc.elements.each("collection/movie/stars") do |e|
  puts "Movie Stars : " + e.text
end

Nokogiri

require 'nokogiri'

Slop

require 'nokogiri'
# Parse XML file
doc = Nokogiri::Slop file

puts doc.search("type").map {|f| t.text}        # List of Types
puts doc.search("format").map {|f| f.text}      # List of Formats
puts doc.search("year").map {|y| y.text}        # List of Year
puts doc.search("rating").map {|r| r.text}      # List of Rating
puts doc.search("stars").map {|s| s.text}       # List of Stars
doc.search("description").map {|d| d.text}      # List of Descriptions

JSON

Assume you have a small vulnerability database in a json file like follows

{
  "Vulnerability": 
  [
    {
      "name": "SQLi",
      "details:": 
        {
          "full_name": "SQL injection",
          "description": "An injection attack wherein an attacker can execute malicious SQL statements",
          "references": [
            "https://www.owasp.org/index.php/SQL_Injection", 
            "https://cwe.mitre.org/data/definitions/89.html"
            ],
          "type": "web"
        }
    }
  ]
}

To parse it

require 'json'
vuln_json = JSON.parse(File.read('vulnerabilities.json'))

Returns a hash

{"Vulnerability"=>`
  [{"name"=>"SQLi",
    "details:"=>
     {"full_name"=>"SQL injection",
      "description"=>"An injection attack wherein an attacker can execute malicious SQL statements",
      "references"=>["https://www.owasp.org/index.php/SQL_Injection", "https://cwe.mitre.org/data/definitions/89.html"],
      "type"=>"web"}}]}

Now you can retrieve and data as you do with hash

vuln_json["Vulnerability"].each {|vuln| puts vuln['name']}

If you want to add to this database, just create a hash with the same struction.

xss = {"name"=>"XSS", "details:"=>{"full_name"=>"Corss Site Scripting", "description"=>" is a type of computer security vulnerability typically found in web applications", "references"=>["https://www.owasp.org/index.php/Cross-site_Scripting_(XSS)", "https://cwe.mitre.org/data/definitions/79.html"], "type"=>"web"}}

You can convert it to json just by using `.to_json` method

xss.to_json

Last updated