How to mass Search & Replace in text files.

William O'Higgins Witteman william.ohiggins-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Fri May 1 20:28:11 UTC 2009


On Fri, May 01, 2009 at 03:25:19PM -0400, Lance F. Squire wrote:
> Not finding a quick solution that worked in this case,

Sorry, work interferes :-)

> I have waded through the 154 files and cleaned them manually.

Drat, I'm late.  For posterity, here is my quick and dirty solution,
that does work on my test data from your pastebin:


#!/usr/bin/python

"""
Strip the bad trojan horse junk out of an HTML file.

"""

import re, os, fnmatch 


def stripick(string):
  """"""
  firstbit = r"</head>\s*<script language=javascript><!--\s*"
  lastbit = r"\s*--></script>"
  badbit = "\(function\(t.*?;"
  wholething = firstbit + badbit + lastbit
  pattern = re.compile(wholething)

  newstring = re.sub(pattern, r"</head>", string)

  return newstring


# Set the root of your recursive search here
top = "/home/willyyam/misc/python/cleanfiles"

for root, dirs, files in os.walk(top):
  for file in files:
    print(os.path.join(root, file))
    if fnmatch.fnmatch(os.path.join(root, file), "*.html"):
      fileobj = open(os.path.join(root, file), "r")
      filestring = fileobj.read() # Get the file contents in memory
      fileobj.close() # Close the file
      
      newfilestring = stripick(filestring) # Clean the string
      
      # Open the file for writing, clobbering it
      fileobj = open(os.path.join(root, file), "w")
      fileobj.write(newfilestring) # Write the new string into the file
      fileobj.close() # Close the file


-- 

yours,

William

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://gtalug.org/pipermail/legacy/attachments/20090501/48ef9a7f/attachment.sig>


More information about the Legacy mailing list