Page 1 of 1

PHP script debugging help

Posted: 2008-09-03 01:59am
by Darth Wong
Does anyone remember the script from this old thread?

I discovered recently that it only works up to a certain size of text. If I copy and paste a block of text that's too large, it just chokes and treats it as if I didn't submit anything at all. Does anyone know why it would do this?

Posted: 2008-09-03 03:02am
by phongn
How big is post_max_size set to?

Posted: 2008-09-03 08:53am
by Darth Wong
phongn wrote:How big is post_max_size set to?
8M. Much bigger than the size of text I'm posting.

Posted: 2008-09-03 11:44am
by Darth Wong
Shit. I just discovered that web browsers limit the amount of text you can copy and paste into an HTML TEXTAREA, to 32kB or 64kB: much too restrictive for processing webpages. So the problem is in the browser, not the script itself.

Maybe I need to give it an alternate way of submitting data.

Posted: 2008-09-03 12:22pm
by Dooey Jo
That seems strange. I'm sure things like Wikipedia send way more than 64kB using textboxes. Perhaps you can try using some other enctype for the form; maybe enctype="multipart/form-data", which is usually used for uploading files.

Posted: 2008-09-03 11:31pm
by Darth Wong
The enctype didn't help, so I just added an alternate way of submitting data: entering the URL so the script can just download the source itself.

Posted: 2008-09-04 12:09am
by phongn
Darth Wong wrote:Shit. I just discovered that web browsers limit the amount of text you can copy and paste into an HTML TEXTAREA, to 32kB or 64kB: much too restrictive for processing webpages. So the problem is in the browser, not the script itself.
Apparently it's part of the HTML specification.
Maybe I need to give it an alternate way of submitting data.
You could have files uploaded instead and then process them.
Dooey Jo wrote:That seems strange. I'm sure things like Wikipedia send way more than 64kB using textboxes. Perhaps you can try using some other enctype for the form; maybe enctype="multipart/form-data", which is usually used for uploading files.
64KB of text?! I don't think so.

Posted: 2008-09-04 12:25am
by Pu-239
Wikipedia does let you edit only subsections of an article which helps get around this. Anyway, assuming 1 byte/character and 8 characters/word and not accounting for spaces or punctuation, an article would have to be 8192 words long to hit this limit. Wiki does have some markup which would increase this. Anyway, the current featured article (http://en.wikipedia.org/wiki/Emmy_Noether) is 95KB long.

Posted: 2008-09-04 05:32am
by Dooey Jo
phongn wrote:
Dooey Jo wrote:That seems strange. I'm sure things like Wikipedia send way more than 64kB using textboxes. Perhaps you can try using some other enctype for the form; maybe enctype="multipart/form-data", which is usually used for uploading files.
64KB of text?! I don't think so.
That's not really that much, especially if there's markup in there, not to mention unicode characters. I just tested posting 74kB of pure text into Google Translate and it worked fine even in IE6, so it must be possible. I also can't find any reference in the specifications to a maximum size for form fields. I know that IE used to limit GET data to 2000 characters or something like that, but that's about it.
Darth Wong wrote:I just added an alternate way of submitting data: entering the URL so the script can just download the source itself.
Yeah, that's probably a more user-friendly solution anyway :)

Posted: 2008-09-04 09:44am
by Darth Wong
New and improved version of the script, with a few other minor improvements, like getting rid of scripts and headers, fixing HRs, and warning you if you've gone over the phpBB post size limit which is 64 kB:

Code: Select all

<?
// HTML to BBCode Converter

// Set search/replace variables
unset($pattern);
unset($replacement);

// Eliminate excess whitespace
$pattern[]="/ [ |\t]+/";
$replacement[]=" ";

// Images (note that .*? is an ungreedy version of .*)
$pattern[]="/<IMG.*?SRC.*?\"(.*?)\".*?>/i";
$replacement[]="[img]\\1[/img]";

// Convert http links to URL BBcode, ignore other link types
$pattern[]="/<A.[^>]*HREF=\"*(HTTP:[^\"]*)\".*?>(.*?)<\/A>/i";
$replacement[]="[url=\\1]\\2[/url]";

// Eliminate forms, floats, headers, and scripts
$pattern[]="/<FORM.*?<\/FORM>/i";
$replacement[]="";
$pattern[]="/<DIV[^>]*FLOAT.*?>.*?<\/DIV>/i";
$replacement[]="";
$pattern[]="/<HEAD.*?<\/HEAD.*?>/i";
$replacement[]="";
$pattern[]="/<SCRIPT.*?<\/SCRIPT.*?>/i";
$replacement[]="";

// Paragraph structure
$pattern[]="/<P.*?>|<\/P>|<DIV.*?>|<\/DIV>|<BR.*?>|<\/TD>/i";
$replacement[]="\n";
$pattern[]="/<BLOCKQUOTE>/i";
$replacement[]="[quote]";
$pattern[]="/<BLOCKQUOTE[^>]*CITE=\"(.*?)\".*?>/i";
$replacement[]="[quote=\"\\1\"]";
$pattern[]="/</BLOCKQUOTE>/i";
$replacement[]="[/quote]";

// Miscellaneous HTML codes
$pattern[]="/<I>|<I .*?>/i";
$replacement[]="[i]";
$pattern[]="/<\/I>/i";
$replacement[]="[/i]";
$pattern[]="/<B>|<B .*?>/i";
$replacement[]="[b]";
$pattern[]="/<\/B>/i";
$replacement[]="[/b]";
$pattern[]="/<U>|<U .*?>/i";
$replacement[]="[u]";
$pattern[]="/<\/U>/i";
$replacement[]="[/u]";
$pattern[]="/<H1.*?>/i";
$replacement[]="\n[b][size=24]";
$pattern[]="/<\/H1>/i";
$replacement[]="[/size][/b]\n";
$pattern[]="/<H2.*?>/i";
$replacement[]="\n[b][size=20]";
$pattern[]="/<\/H2>/i";
$replacement[]="[/size][/b]\n";
$pattern[]="/<H3.*?>/i";
$replacement[]="\n[b][size=16]";
$pattern[]="/<\/H3>/i";
$replacement[]="[/size][/b]\n";
$pattern[]="/<OL.*?>/i";
$replacement[]="[list=1]";
$pattern[]="/<UL.*?>/i";
$replacement[]="[list]";
$pattern[]="/<\/OL>|<\/UL/i";
$replacement[]="[/list]";
$pattern[]="/<LI.*?>/i";
$replacement[]="[*]";
$pattern[]="/<PRE>/i";
$replacement[]="[code]";
$pattern[]="/<\/PRE>/i";
$replacement[]="
";

// Replace horizontal rulers with underscores
$pattern[]="/<HR.*?>/i";
$replacement[]="\n________________________________________\n\t\n";

// Special characters not processed by html_entity_decode
$pattern[]="/&mdash;|&ndash;|–|—/";
$replacement[]="-";
$pattern[]="/&ldquo;|&rdquo;|"|“|”/";
$replacement[]="\"";
$pattern[]="/&rsquo;|&lsquo;|'|‘|’/";
$replacement[]="'";

// Acquire data
unset($htmlsource);
if (isset($_POST['htmlsource']))
{
// Read in data but remove line feeds and carriage returns
$htmlsource=ereg_replace("/\n|\r|\r\n|\n\r/"," ",$_POST['htmlsource']);
}
else if (isset($_POST['url']))
{
$handle=@fopen($_POST['url'],"rb");
if ($handle)
{
stream_set_timeout($handle,10);
$line="";
// Read in source
while (!feof($handle) && strlen($line)<1024*256)
{
$line=rtrim(fgets($handle,1024*256));
$urlsource.=$line." ";
}
// Remove line feeds and carriage returns
$htmlsource=ereg_replace("/\n|\r|\r\n|\n\r/"," ",$urlsource);
}
}

if (isset($htmlsource))
{
// Perform HTML substitution to BBcode
$htmlsource=preg_replace($pattern,$replacement,html_entity_decode($htmlsource));

// Eliminate all remaining HTML tags
$htmlsource=preg_replace("/<.*?>/","",$htmlsource);

// Eliminate large groups of newlines (3 or more)
$htmlsource=preg_replace("/\n[ ]*[\n][ ]*[\n]+/","",$htmlsource);

// Check length
if (strlen($htmlsource)>65535)
{
echo "<h2>Warning: Post length is ".strlen($htmlsource)." characters. ";
echo "The phpBB post size limit is 65535 characters.</h2>\n\n";
}

// Replace remaining newlines with <br /> tags for output
$htmlsource=preg_replace("/\n/","<br />",htmlspecialchars($htmlsource));

// Output results
echo "<html>\n<body>\n".$htmlsource."\n</body>\n</html>\n";
}
else
{
?>
<html>
<body>
<h1 style="text-align:center">HTML Source to BB Code Converter</h1>
<?
if (isset($_POST['textsubmit'])) echo "<p><b>Error</b>: Exceeds script input size limit.</p>\n";
if (isset($_POST['urlsubmit'])) echo "<p><b>Error</b>: URL could not be retrieved.</p>\n";
?>
<p style="margin-bottom:0">Copy and paste the HTML source code into this text box:</p>
<form action="<?$_SERVER['PHP_SELF']?>" method="post">
<textarea rows="20" cols="80" name="htmlsource"></textarea><br />
<input type="submit" name="textsubmit" value="Submit" /><input type="reset" />
</form>
<p style="margin-bottom:0">Alternatively, enter a URL for the website to convert:</p>
<form action="<?$_SERVER['PHP_SELF']?>" method="post">
<input type="text" size="80" name="url" value="http://" />
<input type="submit" name="urlsubmit" value="Submit" /><input type="reset" />
</form>

</body>
</html>
<?
}
?>
[/code]

Posted: 2008-09-04 10:03am
by Darth Wong
I tried the new version on a page full of extraneous markup and scripts (the CNN homepage) and it seems to work fine. It also works on my unfinished hidden Reign of Terror fanfic single-page version even though it's well above the phpBB post size limit (that's because it includes all of the chapters ever written on a single page, including three chapters that I never posted publicly and one which was only partially completed).

Posted: 2008-09-04 04:12pm
by Darth Wong
Does anyone know how to extract the website domain name and path from a URL? In other words:

http://www.cnn.com/ => www.cnn.com
http://www.cnn.com => www.cnn.com
http://www.cnn.com/WORLD/ => www.cnn.com/WORLD
http://www.cnn.com/WORLD/StupidArticle.html => www.cnn.com/WORLD

There doesn't seem to be a built-in function, and maybe I just suck at regexp but I'm having trouble getting it to do what I want.