(.*?)<\/a>/gi, \" Strip HTML from Text JavaScript - Stack Overflow (Link->) \");" /> Strip HTML from Text JavaScript - Stack Overflow

Is there an easy way to take a string of html in JavaScript and strip out the html?

link|improve this question

Check out the ticked answer to this: stackoverflow.com/questions/795512/… – karim79 May 4 '09 at 23:15
this is javascript though. – nickf May 4 '09 at 23:23
feedback

10 Answers

up vote 105 down vote accepted

If you're running in a browser, then the easiest way is just to let the browser do it for you...

function strip(html)
{
   
var tmp = document.createElement("DIV");
   tmp
.innerHTML = html;
   
return tmp.textContent||tmp.innerText;
}
link|improve this answer
+1 good answer! – nickf May 4 '09 at 22:50
13  
Just remember that this approach is rather inconsistent and will fail to strip certain characters in certain browsers. For example, in Prototype.js, we use this approach for performance, but work around some of the deficiencies - github.com/kangax/prototype/blob/… – kangax Sep 14 '09 at 16:08
6  
Remember your whitespace will be messed about. I used to use this method, and then had problems as certain product codes contained double spaces, which ended up as single spaces after I got the innerText back from the DIV. Then the product codes did not match up later in the application. – Magnus Smith Sep 17 '09 at 15:03
4  
@Magnus Smith: Yes, if whitespace is a concern - or really, if you have any need for this text that doesn't directly involve the specific HTML DOM you're working with - then you're better off using one of the other solutions given here. The primary advantages of this method are that it is 1) trivial, and 2) will reliably process tags, whitespace, entities, comments, etc. in the same way as the browser you're running in. That's frequently useful for web client code, but not necessarily appropriate for interacting with other systems where the rules are different. – Shog9 Sep 17 '09 at 21:05
10  
Don't use this with HTML from an untrusted source. To see why, try running strip("<img src=bogus>") – Mike Samuel Sep 22 '11 at 18:06
show 7 more comments
feedback
myString.replace(/<(?:.|\n)*?>/gm, '');
link|improve this answer
Great that it works on non browser js (like node) as well. – Daniel Ribeiro Dec 8 '10 at 19:30
Doesn't work for <img src=http://www.google.com.kh/images/srpr/nav_logo27.png if you're injecting via document.write or concatenating with a string that contains a > before injecting via innerHTML. – Mike Samuel Dec 24 '10 at 15:07
@Mike, you should do the replacement after the string has actually been finished – nickf Dec 26 '10 at 15:48
3  
an easy fix is to change /<.*?>/g to /<[^>]*>?/g. If you agree, please edit your post so that broken security advice doesn't get copy/pasted by naïve users like Mr. Ribeiro. – Mike Samuel Dec 27 '10 at 3:16
1  
@PerishableDave, I agree that the > will be left in the second. That's not an injection hazard though. The hazard occurs due to < left in the first, which causes the HTML parser to be in a context other than data state when the second starts. Note there is no transition from data state on >. – Mike Samuel Sep 22 '11 at 18:04
show 2 more comments
feedback

Another, admittedly less elegant solution than nickf's or Shog9's, would be to recursively walk the DOM starting at the <body> tag and append each text node.

var bodyContent = document.getElementsByTagName('body')[0];
var result = appendTextNodes(bodyContent);

function appendTextNodes(element) {
   
var text = '';

   
// Loop through the childNodes of the passed in element
   
for (var i = 0, len = element.childNodes.length; i < len; i++) {
       
// Get a reference to the current child
       
var node = element.childNodes[i];
       
// Append the node's value if it's a text node
       
if (node.nodeType == 3) {
                text
+= node.nodeValue;
       
}
       
// Recurse through the node's children, if there are any
       
if (node.childNodes.length > 0) {
                appendTextNodes
(node);
       
}
   
}
   
// Return the final result
   
return text;
}
link|improve this answer
1  
yikes. if you're going to create a DOM tree out of your string, then just use shog's way! – nickf May 4 '09 at 23:21
Yes, my solution wields a sledge-hammer where a regular hammer is more appropriate :-). And I agree that yours and Shog9's solutions are better, and basically said as much in the answer. I also failed to reflect in my response that the html is already contained in a string, rendering my answer essentially useless as regards the original question anyway. :-( – Bryan May 5 '09 at 0:08
1  
To be fair, this has value - if you absolutely must preserve /all/ of the text, then this has at least a decent shot at capturing newlines, tabs, carriage returns, etc... Then again, nickf's solution should do the same, and do much faster... eh. – Shog9 May 5 '09 at 4:58
feedback

Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact

The above function posted by hypoxide works fine, but I was after something that would basically convert HTML created in a Web RichText editor (for example FCKEditor) and clear out all HTML but leave all the Links due the fact that I wanted both the HTML and the plain text version to aid creating the correct parts to an STMP email (both HTML and plain text).

After a long time of searching Google myself and my collegues came up with this using the regex engine in Javascript:-

str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>
'
;
str
=str.replace(/<br>/gi, "\n");
str
=str.replace(/<p.*>/gi, "\n");
str
=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str
=str.replace(/<(?:.|\s)*?>/g, "");

the str var starts out like this:-

this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>

which renders like this:-

--start--

this string has html code i want to remove
Link Number 1 ->BBC Link Number 1

Now back to normal text and stuff

--end--

and then after the code has run it looks like this:-

this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk)  Link Number 1


Now back to normal text and stuff

As you can see the all the HTML has been removed and the Link have been persevered with the hyperlinked text is still intact. Also I have replaced the

and
tags with \n (newline char) so that some sort of visual formatting has been retained.

To change the link format (eg. "BBC (Link->http://www.bbc.co.uk)" ) just edit the " $2 (Link->$1) ", where $1 is the href URL/URI and the $2 is the hyperlinked text. With the links directly in body of the plain text most SMTP Mail Clients convert these so the user has the ability to click on them.

Hope you find this useful.

link|improve this answer
feedback

I built this JavaScript library for a Konfabulator widget that does exactly that. It completely strips out comments and <style> and <script> tags and tries to be somewhat smart about converting <br/>'s and <p/>'s into newlines as well.

http://github.com/mtrimpe/jsHtmlToText

link|improve this answer
feedback

I think the easiest way is to just use Regular Expressions as someone mentioned above. Although there's no reason to use a bunch of them. Try:

stringWithHTML = stringWithHTML.replace(/<\/?[a-z][a-z0-9]*[^<>]*>/ig, "");
link|improve this answer
feedback

Simplest way:

jQuery('#mydiv').text();

That retrieves all the text inside a div.

link|improve this answer
3  
Great Answer! you just added 31kb to your project (JQuery) for a simple function. – Dementic Dec 31 '11 at 14:20
1  
We always use jQuery for projects since invariably our projects have a lot of Javascript. Therefore we didn't add bulk, we took advantage of existing API code... – Mark Mar 14 at 16:31
1  
You use it, but the OP might not. the question was about Javascript NOT JQuery. – Dementic Mar 14 at 16:55
feedback

I made some modifications to original Jibberboy2000 script Hope it'll be usefull for someone

str = '**ANY HTML CONTENT HERE**';

str
=str.replace(/<\s*br\/*>/gi, "\n");
str
=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str
=str.replace(/<\s*\/*.+?>/ig, "\n");
str
=str.replace(/ {2,}/gi, " ");
str
=str.replace(/\n+\s*/gi, "\n\n");
link|improve this answer
Was this post useful to you?     

I altered Jibberboy2000's answer to include more BR tag types formats, remove everything inside SCRIPT and STYLE tags, format the resulting HTML by removing multiple line breaks and spaces and convert some HTML-encoded code into normal. After some testing it appears that you can convert most of full web pages into simple text where page title and content are retained.

In the simple example,

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->

<head>

<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>

    body
{margin-top: 15px;}
    a
{ color: #D80C1F; font-weight:bold; text-decoration:none; }

</style>
</head>

<body>
   
<center>
        This string has
<i>html</i> code i want to <b>remove</b><br>
        In this line
<a href="http://www.bbc.co.uk">BBC</a> with link is mentioned.<br/>Now back to &quot;normal text&quot; and stuff using &lt;html encoding&gt;                
   
</center>
</body>
</html>

becomes

This is my title

This string has html code i want to remove

In this line BBC (http://www.bbc.co.uk) with link is mentioned.

Now back to "normal text" and stuff using <html encoding>  

The Javascript function and test page look this:

function convertHtmlToText() {
   
var inputText = document.getElementById("input").value;
   
var returnText = "" + inputText;

   
//-- remove BR tags and replace them with line break
    returnText
=returnText.replace(/<br>/gi, "\n");
    returnText
=returnText.replace(/<br\s\/>/gi, "\n");
    returnText
=returnText.replace(/<br\/>/gi, "\n");

   
//-- remove P and A tags but preserve what's inside of them
    returnText
=returnText.replace(/<p.*>/gi, "\n");
    returnText
=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");

   
//-- remove all inside SCRIPT and STYLE tags
    returnText
=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
    returnText
=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
   
//-- remove all else
    returnText
=returnText.replace(/<(?:.|\s)*?>/g, "");

   
//-- get rid of more than 2 multiple line breaks:
    returnText
=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");

   
//-- get rid of more than 2 spaces:
    returnText
= returnText.replace(/ +(?= )/g,'');

   
//-- get rid of html-encoded characters:
    returnText
=returnText.replace(/&nbsp;/gi," ");
    returnText
=returnText.replace(/&amp;/gi,"&");
    returnText
=returnText.replace(/&quot;/gi,'"');
    returnText
=returnText.replace(/&lt;/gi,'<');
    returnText
=returnText.replace(/&gt;/gi,'>');

   
//-- return
    document
.getElementById("output").value = returnText;
}

</script>

It was used with this HTML:

<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />
link|improve this answer
feedback
    (function($){
        $
.html2text = function(html) {
           
if($('#scratch_pad').length === 0) {
                $
('<div id="lh_scratch"></div>').appendTo('body');  
           
}
           
return $('#scratch_pad').html(html).text();
       
};

   
})(jQuery);

Define this as a jquery plugin and use it like as follows:

$.html2text(htmlContent);
link|improve this answer
feedback

Your Answer

 
or
required, but never shown

Not the answer you're looking for? Browse other questions tagged or ask your own question.