I had an issue where I had to clean an html page / pages that were built using Microsoft Word. Since I had to use these pages in many locations and apply a new style to them. I needed a good way to clean the HTML and did not want to do it by hand. I did a few searches and found that Microsoft actually makes a tool to do this. They make an office add-on that will clean the Word / Excel / Frontpage junk html. You think they would have done this in the actual application.
Well that tool would not for work me, because I have office 2003 and the tool was for office 2000. So I did some more looking and found a nifty tool called tidy. Here is a gui version of tidy (TIDY). The command line version is here and you can implement into your application.
Using this saved me a few hours of work. I love open source.