基于Delphi的Web文本获取方法
刘建培
【期刊名称】《计算机时代》 【年(卷),期】2016(000)003
【摘要】提出基于delphi的Web文本获取方法,从网页中获取Web页面格式的源文件(.html文件),分析它的结构信息,处理它的控制符,通过分析过滤源文件的格式来提取网页中的文本信息。利用标点符号对文本信息进行章节、段落、句子等预处理,将文本信息转换成句子序列,让用户快速地定位到需要了解的内容,从而让用户远离钓鱼网站、恶意广告、欺诈信息以及在浏览网页内容时产生的骚扰,提高互联网体验。%In this paper, a method of Web text acquisition with Delphi is proposed, which obtains the source files of the Web page format (.Html file) from the Web page, analyzes its structure information, deals with its control character, and extracts the text information from the Web page by analyzing and filtering the source files’ formats. The method makes use of punctuation marks to preprocess the text information for sections, paragraphs and sentences, converts the text information into sentence sequences, which allows the users to quickly navigate to the contents needed to know, allows the users to stay away from phishing sites, malicious advertising, fraud information and the harassment generated by browsing the content of Web pages, and improves their Internet experience. 【总页数】3页(50-52)