Page 1 of 1

New PHP script: any security issues?

Posted: 2008-08-06 04:09pm
by Darth Wong
I just created a PHP script, mostly for my own use (although anybody can you use it if you like), which converts HTML formatted pages to BBcode. I made it because I hate the way copying and pasting from the web browser often butchers the shit out of a webpage, especially with all kinds of unnecessary line-breaks. I used to have a tutorial for doing this in the Announcements area, but that involved creating your own *nix script so it was a pretty big PITA for most users, and a web-based conversion utility is much simpler.

The question is: is there anything I should be concerned about, security wise? It's a pretty simple script so I can't see how something could go wrong, but I'm no security expert:

http://bbs.stardestroyer.net/html2bbcode.php

Code: Select all

<?
// HTML to BBCode Converter

// Set search/replace variables
unset($pattern);
unset($replacement);

// Eliminate whitespace
$pattern[]="/ [ |\t]+/";
$replacement[]=" ";

// Images (note that .*? is an ungreedy version of .*)
$pattern[]="/<IMG.*?SRC.*?\"(.*?)\".*?>/i";
$replacement[]="[img]\\1[/img]";

// Links
$pattern[]="/<A.[^>]*HREF[^\"]*\"([^\"]*)\".*?>(.*?)<\/A>/i";
$replacement[]="[url=\\1]\\2[/url]";

// Forms
$pattern[]="/<FORM.*?<\/FORM>/i";
$replacement[]="";

// Floats
$pattern[]="/<DIV[^>]*FLOAT.*?>.*?<\/DIV>/i";
$replacement[]="";

// Paragraph structure
$pattern[]="/<P.*?>|<\/P>|<DIV.*?>|<\/DIV>|<BR.*?>|<\/TD>/i";
$replacement[]="\n";
$pattern[]="/<BLOCKQUOTE>/i";
$replacement[]="[quote]";
$pattern[]="/<BLOCKQUOTE[^>]*CITE=\"(.*?)\".*?>/i";
$replacement[]="[quote=\"\\1\"]";
$pattern[]="/</BLOCKQUOTE>/i";
$replacement[]="[/quote]";

// Miscellaneous HTML codes
$pattern[]="/<I>|<I .*?>/i";
$replacement[]="[i]";
$pattern[]="/<\/I>/i";
$replacement[]="[/i]";
$pattern[]="/<B>|<B .*?>/i";
$replacement[]="[b]";
$pattern[]="/<\/B>/i";
$replacement[]="[/b]";
$pattern[]="/<U>|<U .*?>/i";
$replacement[]="[u]";
$pattern[]="/<\/U>/i";
$replacement[]="[/u]";
$pattern[]="/<H1.*?>/i";
$replacement[]="\n[b][size=24]";
$pattern[]="/<\/H1>/i";
$replacement[]="[/size][/b]\n";
$pattern[]="/<H2.*?>/i";
$replacement[]="\n[b][size=20]";
$pattern[]="/<\/H2>/i";
$replacement[]="[/size][/b]\n";
$pattern[]="/<H3.*?>/i";
$replacement[]="\n[b][size=16]";
$pattern[]="/<\/H3>/i";
$replacement[]="[/size][/b]\n";
$pattern[]="/<OL.*?>/i";
$replacement[]="[list=1]";
$pattern[]="/<UL.*?>/i";
$replacement[]="[list]";
$pattern[]="/<\/OL>|<\/UL/i";
$replacement[]="[/list]";
$pattern[]="/<LI.*?>/i";
$replacement[]="[*]";
$pattern[]="/<PRE>/i";
$replacement[]="[code]";
$pattern[]="/<\/PRE>/i";
$replacement[]="
";

// Special characters not processed by html_entity_decode
$pattern[]="/&mdash;|&ndash;|–|—/";
$replacement[]="-";
$pattern[]="/&ldquo;|&rdquo;|"|“|”/";
$replacement[]="\"";
$pattern[]="/&rsquo;|&lsquo;|'|‘|’/";
$replacement[]="'";

// Acquire data
if (isset($_POST['htmlsource']))
{
// Read in data but remove line feeds and carriage returns
$htmlsource=ereg_replace("/\n|\r|\r\n|\n\r/"," ",$_POST['htmlsource']);

// Perform HTML substitution to BBcode
$htmlsource=preg_replace($pattern,$replacement,html_entity_decode($htmlsource));

// Eliminate all remaining HTML tags
$htmlsource=preg_replace("/<.*?>/","",$htmlsource);

// Replace remaining newlines with <br /> tags for output
$htmlsource=preg_replace("/\n/","<br />",htmlspecialchars($htmlsource));

// Output results
echo "<html>\n<body>\n".$htmlsource."\n</body>\n</html>\n";
}
else
{
?>
<html>
<body>
<h1 style="text-align:center">HTML Source to BB Code Converter</h1>
<p>Copy and paste the HTML source code into this text box:</p>
<form action="<?$_SERVER['PHP_SELF']?>" method="post">
<textarea rows="20" cols="80" name="htmlsource"></textarea><br />
<input type="submit" value="Submit" /><input type="reset" />
</form>
</body>
</html>
<?
}
?>[/code]

Posted: 2008-08-06 05:43pm
by Ariphaos
Someone may vet your regexps, but this is the only thing that caught my eye:
$htmlsource=ereg_replace("/\n|\r|\r\n|\n\r/"," ",$_POST['htmlsource']);
Probably want that to be preg_replace, I would assume.

Otherwise, you're just doing a lot of regexps on a submission and outputting bbcode, you're not trusting the information at all except to process it.

Posted: 2008-08-06 06:33pm
by Starglider
Damn. That reminds me, I still have all the proposed code upgrades for this board waiting to be packaged into a single mod against the new code base. All available time for that got sucked up by contributing to the Armageddon fic (I'm a really slow writer).

Maybe I'll have time to take a look at that over Christmas. Maybe. :(

Anyway, this code looks fine to me.

Posted: 2008-08-07 04:21am
by Dooey Jo
There are also the html_entity_decode() and htmlentities() functions for transforming html entities into their applicable characters, and reverse. As well, there is a trim() function that removes whitespace, and a nl2br() that transforms newlines into <br/> nodes.

Posted: 2008-08-07 03:48pm
by Darth Wong
Dooey Jo wrote:There are also the html_entity_decode() and htmlentities() functions for transforming html entities into their applicable characters, and reverse. As well, there is a trim() function that removes whitespace, and a nl2br() that transforms newlines into <br/> nodes.
Good idea, using html_entity_decode, although it doesn't get all of the special character codes for some reason. Nevertheless, it cuts down significantly on the size of the script. The trim() function won't work for me because when you enter the text into a textarea input field on an HTML form, it comes through as a single large text variable, not as a series of lines upon which you can use trim(). And nl2br() does the opposite of what I want to do, which is to convert <br /> codes to newlines.
Destructionator XIII wrote:Including translation from a html <code>, or better yet, <pre> block into a bbcode one might be useful too.
Ah yes, that's useful. I edited that into the code.
Starglider wrote:Damn. That reminds me, I still have all the proposed code upgrades for this board waiting to be packaged into a single mod against the new code base. All available time for that got sucked up by contributing to the Armageddon fic (I'm a really slow writer).

Maybe I'll have time to take a look at that over Christmas. Maybe. :(

Anyway, this code looks fine to me.
Good to hear, but you may want to hold off on that. I've been informed that the phpbb team is withdrawing support for the entire phpBB 2.0.x codebase as of this coming October, which means I have to upgrade to version 3.0.x of the software before then. There will be some pretty drastic changes in terms of featuresets when I do this (I'm currently working on modifying the default template to look more like BlackSoul), so we'll have to see what remains to be done afterwards.
Xeriar wrote:Probably want that to be preg_replace, I would assume.
You would think that, but for some reason the NL/CR replacements worked with ereg_replace but not with preg_replace. I wasn't about to beat my head against my desk trying to figure out why, so I just left it that way.

Posted: 2008-08-10 06:24am
by Dooey Jo
Darth Wong wrote:
Dooey Jo wrote:There are also the html_entity_decode() and htmlentities() functions for transforming html entities into their applicable characters, and reverse. As well, there is a trim() function that removes whitespace, and a nl2br() that transforms newlines into <br/> nodes.
Good idea, using html_entity_decode, although it doesn't get all of the special character codes for some reason. Nevertheless, it cuts down significantly on the size of the script. The trim() function won't work for me because when you enter the text into a textarea input field on an HTML form, it comes through as a single large text variable, not as a series of lines upon which you can use trim(). And nl2br() does the opposite of what I want to do, which is to convert <br /> codes to newlines.
Ah, I forgot that trim() only cuts the whitespace at the beginning and end of a string, so no, it won't work very well here. I figured you could use nl2br() for this line, though:

Code: Select all

   // Replace remaining newlines with <br /> tags for output
   $htmlsource=preg_replace("/\n/","<br />",htmlspecialchars($htmlsource));