HTML中提取文字内容，去掉标签样式等-天翼云

HTML中提取文字内容，去掉标签样式等

2022-12-28 07:22:30 阅读次数：172

html代码如下

<h1>登鹳雀楼</h1>
        <div class="poem-detail-header-info">
                                                <a class="poem-detail-header-author" href="https://www.ctyun.cn/portal/link.html?target=%2Fs%3Fwd%3D%E7%8E%8B%E4%B9%8B%E6%B6%A3">
                        <span class="poem-info-gray">【作者】</span>王之涣
                    </a>
                                                        <span class="poem-detail-header-author">
                    <span class="poem-info-gray">【朝代】</span>唐
                </span>
                                    <div class="body-means-change">
                译文对照
            </div>
                    </div>
        <div class="poem-detail-separator"></div>
                                <div class="poem-detail-item-content">
                                <p class="poem-detail-main-text" id="body_p">
                    <span id="body_1_0" data="means_1_0"><em><span class='body-zhushi-span' data='"\u592a\u9633\u3002"'>白日</span><span class='body-zhushi-span' data='"\u4f9d\u508d\u3002"'>依</span>山尽，</em></span><span id="body_1_1" data="means_1_1">黄河入海流。</span>                </p>
                <p id="means_p" class="poem-detail-main-text body-means-p">
                    <span id="means_1_0" data="body_1_0">夕阳依傍着西山慢慢地沉没，</span><span id="means_1_1" data="body_1_1">滔滔黄河朝着东海汹涌奔流。</span>                </p>
                                <p class="poem-detail-main-text" id="body_p">
                    <span id="body_2_0" data="means_2_0"><span class='body-zhushi-span' data='"\u60f3\u8981\u5f97\u5230\u67d0\u79cd\u4e1c\u897f\u6216\u8fbe\u5230\u67d0\u79cd\u76ee\u7684\u7684\u613f\u671b\uff0c\u4f46\u4e5f\u6709\u5e0c\u671b\u3001\u60f3\u8981\u7684\u610f\u601d\u3002"'>欲</span><span class='body-zhushi-span' data='"\u5c3d\uff0c\u4f7f\u8fbe\u5230\u6781\u70b9\u3002"'>穷</span>千里目，</span><span id="body_2_1" data="means_2_1"><span class='body-zhushi-span' data='"\u66ff\u3001\u6362\u3002\uff08\u4e0d\u662f\u901a\u5e38\u7406\u89e3\u7684\u201c\u518d\u201d\u7684\u610f\u601d\uff09"'>更</span>上一层楼。</span>                </p>
                <p id="means_p" class="poem-detail-main-text body-means-p">

提取效果

HTML中提取文字内容，去掉标签样式等

java代码

private static final String regEx_script = "<script[^>]*?>[\\s\\S]*?<\\/script>"; // 定义script的正则表达式
    private static final String regEx_style = "<style[^>]*?>[\\s\\S]*?<\\/style>"; // 定义style的正则表达式
    private static final String regEx_html = "<[^>]+>"; // 定义HTML标签的正则表达式

    /**
     * @param htmlStr
     * @return 删除Html标签
     */
    public static String formatHTMLTag(String htmlStr) {
        Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
        Matcher m_script = p_script.matcher(htmlStr);
        htmlStr = m_script.replaceAll(""); // 过滤script标签

        Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
        Matcher m_style = p_style.matcher(htmlStr);
        htmlStr = m_style.replaceAll(""); // 过滤style标签

        htmlStr = htmlStr.replaceAll("<br\\/>", "\n");// 换行替换
        htmlStr = htmlStr.replaceAll("</p>", "\n");// 段落替换

        Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
        Matcher m_html = p_html.matcher(htmlStr);
        htmlStr = m_html.replaceAll(""); // 过滤html标签

        htmlStr = htmlStr.replaceAll("&nbsp;", "");
        htmlStr = htmlStr.replaceAll("&", "&");


        Pattern p = Pattern.compile("(\r?\n(\\s*\r?\n)+)");//多个换行替换成一个
        Matcher m = p.matcher(htmlStr);
        htmlStr = m.replaceAll("\r\n");

        return htmlStr; // 返回文本字符串
    }

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

HTML中提取文字内容，去掉标签样式等

HTML中提取文字内容，去掉标签样式等

html代码如下

提取效果

java代码

相关文章

Javaweb编程中的乱码问题

JavaScript|数据类型的使用

数据结构14-栈常见操作3

将服务器传入的字段转成html渲染出来

php 制作package满足公司的解耦业务

php 实现批量的下载pdf (使用filedownload)

js截取字符串中的数字

动态圣诞树html网页完整代码

MySQL查询某个字段含有字母数字的值

【linux系统操作】 - 技术一览

作者介绍

最新文章

Javaweb编程中的乱码问题

JavaScript|数据类型的使用

数据结构14-栈常见操作3

将服务器传入的字段转成html渲染出来

php 制作package满足公司的解耦业务

php 实现批量的下载pdf (使用filedownload)

热门文章

Python：使用2to3将Python2转Python3

html：canvas画布绘图简单入门

TypeScript-webpack配置

ajax乱码问题和异步同步问题

html+css实战183-购物车

js中通过正则表达式验证邮箱是否合法

热门标签

相关产品

弹性云主机

天翼云电脑（公众版）

对象存储

云硬盘

随机文章

Python编程：trio模块异步/等待本地I/O库

vue的基本代码

HTTP模块-区分发送的请求是GET还是POST请求

一文读懂css的相对定位【relative position】以及相对定位为什么要设置偏移量？

html+css实战58-行高

JSON简介