`

一种解决HTTP抓取网页超时设置无效的方法

阅读更多

今天发现superword在获取单词定义的时候,对于不常见单词,网页打开很慢,超过10秒,经检查,发现是利用Jsoup来抓取单词定义的时候,设置的超时3秒无效,_getContent方法的执行时间超过10秒,代码如下:

 

    public static String getContent(String url) {
        String html = _getContent(url);
        int times = 0;
        while(StringUtils.isNotBlank(html) && html.contains("非常抱歉,来自您ip的请求异常频繁")){
            //使用新的IP地址
            ProxyIp.toNewIp();
            html = _getContent(url);
            if(++times > 2){
                break;
            }
        }
        return html;
    }

    private static String _getContent(String url) {
        Connection conn = Jsoup.connect(url)
                .header("Accept", ACCEPT)
                .header("Accept-Encoding", ENCODING)
                .header("Accept-Language", LANGUAGE)
                .header("Connection", CONNECTION)
                .header("Referer", REFERER)
                .header("Host", HOST)
                .header("User-Agent", USER_AGENT)
                .timeout(3000)
                .ignoreContentType(true);
        String html = "";
        try {
            html = conn.post().html();
            html = html.replaceAll("[\n\r]", "");
        }catch (Exception e){
            LOGGER.error("获取URL:" + url + "页面出错", e);
        }
        return html;
    }

 

所以想了一个办法来解决这个问题,核心思想是主线程启动一个子线程来抓取单词定义,然后主线程休眠指定的超时时间,当超时时间过去后,从子线程获取抓取结果,这个时候如果子线程抓取还未完成,则主线程返回空的单词定义,代码如下:

 

    public static String getContent(String url) {
        long start = System.currentTimeMillis();
        String html = _getContent(url, 1000);
        LOGGER.info("获取拼音耗时: {}", TimeUtils.getTimeDes(System.currentTimeMillis()-start));
        int times = 0;
        while(StringUtils.isNotBlank(html) && html.contains("非常抱歉,来自您ip的请求异常频繁")){
            //使用新的IP地址
            ProxyIp.toNewIp();
            html = _getContent(url);
            if(++times > 2){
                break;
            }
        }
        return html;
    }

    private static String _getContent(String url, int timeout) {
        Future<String> future = ThreadPool.EXECUTOR_SERVICE.submit(()->_getContent(url));
        try {
            Thread.sleep(timeout);
            return future.get(1, TimeUnit.NANOSECONDS);
        } catch (Throwable e) {
            LOGGER.error("获取网页异常", e);
        }
        return "";
    }

    private static String _getContent(String url) {
        Connection conn = Jsoup.connect(url)
                .header("Accept", ACCEPT)
                .header("Accept-Encoding", ENCODING)
                .header("Accept-Language", LANGUAGE)
                .header("Connection", CONNECTION)
                .header("Referer", REFERER)
                .header("Host", HOST)
                .header("User-Agent", USER_AGENT)
                .timeout(1000)
                .ignoreContentType(true);
        String html = "";
        try {
            html = conn.post().html();
            html = html.replaceAll("[\n\r]", "");
        }catch (Exception e){
            LOGGER.error("获取URL:" + url + "页面出错", e);
        }
        return html;
    }

 

 

详细代码地址:

https://github.com/ysc/superword/commit/e4bc3c4197af95a8d7519856c89d592515a1c18f

 

 

 

 

 

 

1
1
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics