Never been to CodeSnippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world (or not, you can keep them private!)

HTML stripper (See related posts)

// description of your code here

str = <<HTML_TEXT
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">
<html>
<body>
  <h1>Application error</h1>
  <p>Change this error message for exceptions thrown outside of an action (like 
in Dispatcher setups or broken Ruby code) in public/500.html</p>
</body>
</html>
HTML_TEXT

puts str.gsub(/<\/?[^>]*>/, "")

Comments on this post

jamiew posts on Jan 15, 2008 at 21:11
I also use this from time to time

class String
  def strip_html(allowed = ['a','img','p','br','i','b','u','ul','li'])
  	str = self.strip || ''
  	str.gsub(/<(\/|\s)*[^(#{allowed.join('|') << '|\/'})][^>]*>/,'')
  end
end
quantum_dot posts on Mar 21, 2008 at 03:50
Why this?

  	str = self.strip || ''


I don't think String#strip ever returns nil or false, so it does nothing? Also why '|\/' ? This allows </> for no apparent reason? Plus some of those characters have special meaning inside of a regular expression like b for example so it will always strip those even if allowed, or a capital A, plus it's case sensitive which HTML isn't mind you, XHTML is.

I knocked this up quickly, it was made with trial/error so there's no guarantee it's perfect but you can see how it works from the included tests. Save it into it's own file and require it to use, run with ruby [file name] to run.

#!/usr/bin/env ruby

# String#strip_html - Removes HTML tags from a string.
# 
# Author:: Rob Pitt
# Copyright:: Copyright (c) 2008 Rob Pitt
# License:: Free to use and modify so long as credit to previous author(s) is left in place.
#

class String
  # Removes HTML tags from a string. Allows you to specify some tags to be kept.
  def strip_html( allowed = [] )    
    re = if allowed.any?
      Regexp.new(
        %(<(?!(\\s|\\/)*(#{
          allowed.map {|tag| Regexp.escape( tag )}.join( "|" )
        })( |>|\\/|'|"|<|\\s*\\z))[^>]*(>+|\\s*\\z)),
        Regexp::IGNORECASE | Regexp::MULTILINE, 'u'
      )
    else
      /<[^>]*(>+|\s*\z)/m
    end
    gsub(re,'')
  end
end

#--
# String#strip_html Self Test
if $0 == __FILE__
  # Go here for reference on rspec: http://rspec.info/examples.html
  require 'rubygems'
  require 'spec'

  describe String do
    before(:all) do
      @html = "foo < BAR> <><a href=\"www.beer.com\">xyzzy</a>>><br /><p><br/><p><br///><p></><br ///><p></br><p></ br> <cheeky"
    end

    it "must should strip html tags" do
      @html.strip_html.should == "foo  xyzzy "
    end
    # cheeky keeper
    it "should allow given tags" do
      @html.strip_html( %w(bar br ) ).should == "foo < BAR> xyzzy<br /><br/><br///><br ///></br></ br> " 
    end

    it "should treat unclosed tags at the end of the document as tags to be safe" do
      @html.strip_html( %w(cheeky monkey) ).should == "foo  xyzzy <cheeky"
      "pass< cheeky".strip_html( %w(cheeky) ).should == "pass< cheeky"
      "pass< cheeky ".strip_html( %w(cheeky) ).should == "pass< cheeky "
      "pass< cheeky  ".strip_html( %w(cheeky) ).should == "pass< cheeky  "
    end
    
    it "should treat quotes, less thans and shashes as tag word terminators to be conservative" do
      q_test_html = "pass<img'foo' >"
      dbq_test_html = "pass<img\"foo\""
      lt_test_html = "pass<img<foo\" "
      q_test_html.strip_html( %w(img) ).should == q_test_html
      dbq_test_html.strip_html( %w(img) ).should == dbq_test_html
      lt_test_html.strip_html( %w(img) ).should == "pass<img" # treats foo as new tag
    end
    
    it "should know the difference between a tag called a and one called alpha" do
      difference_test = "passed<alpha> the test"
      difference_test.strip_html( 'a' ).should == "passed the test"
      difference_test.strip_html( 'alpha' ).should == difference_test
    end
    
  end
    
end
#++





You need to create an account or log in to post comments to this site.