Never been to CodeSnippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world (or not, you can keep them private!)

Snip - extract a named element from an html file using bash

This is a primitive way of achieving the kind of data extraction that is more commonly associated with true XML for any reasonably modern html file (i.e. that it is well-formed and makes proper use of the id property). The purpose is mainly to get simple, yet fast and efficient text browsing, especially useful for quick look-ups and the like, e.g. dictionaries, thesauruses (thesauri?), encyclopedias etc. Since the data you're interested in is usually put into a specific element, text browsing is often greatly enhanced by extracting the element in question and discarding the rest. You run the script by specifying an element in the standard css way (element#id) and the file which is to be 'parsed', and the script responds by spitting out the element (and only that element) through html2text which does a really nice job of turning html code into legible console text.

EDIT: Added a quick check for the presence/absence of the element type in the line (before the grep operations) - greatly increases speed with large elements like #content on wikipedia.

#! /bin/bash

printhelp () {
echo "snip is a simple bash html cutter that works by extracting a specific element 
from an html file and feeding it to html2text. It presupposes wellformed html
and that you know the kind of element you want and it's id.

Syntax:
snip <element  type>#<element id> <file to parsed>

Example:
snip div#bodyContent /tmp/index.html
"
exit
}

quitter () {
echo "Element id not found. Quitting."; exit
}

[ "$1" = "-h" -o "$1" = "--help" -o "$1" = "" ] && printhelp

elementtype="$(echo $1 | cut -d '#' -f 1)"
id="$(echo $1 | cut -d '#' -f 2)"
htmlfile="$2"
thebegin=$(grep -nioE "id=\"$id\"" $htmlfile | cut -d ':' -f 1)
# echo $thebegin
[ -n "$thebegin" ] || quitter

sed -n ${thebegin}p "$htmlfile" | sed -re "s/^.*id=\"$id\"/<$elementtype id=\"$id\"/g" > /tmp/snipfile
sed -n $(($thebegin+1)),\$p "$htmlfile"  >> /tmp/snipfile

i=0
element=0
cat /tmp/snipfile | while read line; do
	let i++
	if [[ "$line" =~ "$elementtype" ]]; then
		elementbegincount="$(echo $line | grep -io "<$elementtype" | grep -c .)"
		elementendcount="$(echo $line | grep -io "</$elementtype" | grep -c .)"
		element=$(($element+$elementbegincount-$elementendcount))
		if [ "$element" -le 0 ]; then
			sed -n 1,${i}p /tmp/snipfile | html2text
			exit
		fi
	fi
done


As an example of how the script can be put to use, here's my Wikipedia lookup (the script above is referred to as 'snip' here):

#! /bin/bash

useragent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071619 Firefox/3.0.1"
if wget -q -U "$useragent" -O /tmp/wpfile "http://en.wikipedia.org/wiki/Special:Search?search=$*"; then
	clear
	echo "Page downloaded..."
	snip div#content /tmp/wpfile | less
else
	echo "No connection, sorry. Please try again."
fi

Serve php within .htm

In your .htaccess file (maybe only in a specific folder) add this line to parse a .htm as a php file. This works on TxD accounts.
AddType application/x-httpd-php .htm .php

Example using xmlplarser's saxdriver to parse huge files

// description of your code here

#!/usr/bin/evn ruby
## to run this you call run_amazon_import(datafile) with dataflie = a file to open for parsing, which later is opened based on:
## ("#{RAILS_ROOT}/data/" + datafile + ".xml")
## This is hard coded to look at Item elements, and in this example
## parses out the ASIN as @@product_id and ItemAttributes/Title as @@name
## see check_position_space(name,ch)

require 'xml/saxdriver'
@flag_item  = false

  @@finaldata = []
  @@vars = []
  @@positionSpace = []
  @@currentName = []
def reset_vals
  @@product_id = nil
  @@name = nil
end
def check_position_space(name,ch)
  # with each value within item we check to see if the
  # @@positionSpace (a concatenation of each value's name
  # equals the value we are looking for, if so put it in a global 
  # variable
  if @@positionSpace == 'ASIN'
    @@product_id =  ch
  elsif @@positionSpace == 'ItemAttributesTitle'
  	# if I did this again, I would name @@positionSpace
  	# with / between names in startElement so it would be simlar to other 
  	# ruby xml naming schems so:
  	# @@positionSpace == 'ItemAttributesTitle' would be:
  	# @@positionSpace == 'ItemAttributes/Title'
  	@@name = ch
  end
end

class TestHandler < XML::SAX::HandlerBase
  attr_accessor :data
  def startDocument
    @@data = []
  end
  def startElement(name, attr)
    @flag_item = true if name == 'Item'
    @@positionSpace = '' if name == 'Item'
    if @flag_item == true and name != 'Item'
        @@positionSpace = @@positionSpace + name
    elsif name == 'Item'
      reset_vals
    end
    @@currentName = name
  end
  def endElement(name)
    if @flag_item == true and name != 'Item'
        lenName = name.length
        @@positionSpace = @@positionSpace[0, @@positionSpace.length - lenName]
    end
    if name == 'Item'
      @@finaldata  << @@data.to_s
      @@data = []
	  ## Here I would have a fully parsed Item and do something with it
    end
    @flag_item = false if name == 'Item'
  end
  def characters(ch, start, length)
    check_position_space(@@currentName, ch[start, length])
  end
end

def run_amazon_import(datafile)
  @@datafile = datafile
  p = XML::SAX::Helpers::ParserFactory.makeParser("XML::Parser::SAXDriver")
  h = TestHandler.new
  p.setDocumentHandler(h)
  p.setDTDHandler(h)
  p.setEntityResolver(h)
  p.setErrorHandler(h)

  begin
    p.parse("#{RAILS_ROOT}/data/" + datafile + ".xml")
  rescue XML::SAX::SAXParseException
    p(["ParseError", $!.getSystemId, $!.getLineNumber, $!.getMessage])
  end
end

xml to objects

Доступ к дереву XML как к обычным объектам

Для этого можно использовать XSD::Mapping из стандартной библиотеки:

requirexsd/mapping’ 
people = XSD::Mapping.xml2obj(File.read("people.xml")) 
people.person[2].name # => "name3" 


Если в имени тэга присутствует дефис, можно сделать так: people[’foo-bar’]

Ну а выполнить обратное преобразование объектного дерева в XML поможет метод: XSD::Mapping.obj2xml