MRC 文档压缩的文本分段。

Text segmentation for MRC document compression.

机构信息

School of Electrical and Computer Engineering, Purdue University,West Lafayette, IN 47907-2035, USA.

出版信息

IEEE Trans Image Process. 2011 Jun;20(6):1611-26. doi: 10.1109/TIP.2010.2101611. Epub 2010 Dec 23.

DOI:10.1109/TIP.2010.2101611

Abstract

The mixed raster content (MRC) standard (ITU-T T.44) specifies a framework for document compression which can dramatically improve the compression/quality tradeoff as compared to traditional lossy image compression algorithms. The key to MRC compression is the separation of the document into foreground and background layers, represented as a binary mask. Therefore, the resulting quality and compression ratio of a MRC document encoder is highly dependent upon the segmentation algorithm used to compute the binary mask. In this paper, we propose a novel multiscale segmentation scheme for MRC document encoding based upon the sequential application of two algorithms. The first algorithm, cost optimized segmentation (COS), is a blockwise segmentation algorithm formulated in a global cost optimization framework. The second algorithm, connected component classification (CCC), refines the initial segmentation by classifying feature vectors of connected components using an Markov random field (MRF) model. The combined COS/CCC segmentation algorithms are then incorporated into a multiscale framework in order to improve the segmentation accuracy of text with varying size. In comparisons to state-of-the-art commercial MRC products and selected segmentation algorithms in the literature, we show that the new algorithm achieves greater accuracy of text detection but with a lower false detection rate of nontext features. We also demonstrate that the proposed segmentation algorithm can improve the quality of decoded documents while simultaneously lowering the bit rate.

摘要

混合光栅内容 (MRC) 标准 (ITU-T T.44) 规定了一种文档压缩框架，与传统的有损图像压缩算法相比，它可以显著改善压缩/质量的权衡。MRC 压缩的关键是将文档分为前景和背景层，并用二进制掩模表示。因此，MRC 文档编码器的质量和压缩比高度依赖于用于计算二进制掩模的分割算法。在本文中，我们提出了一种新的基于两种算法顺序应用的 MRC 文档编码多尺度分割方案。第一种算法，成本优化分割 (COS)，是一种基于全局成本优化框架的分块分割算法。第二种算法，连通分量分类 (CCC)，通过使用马尔可夫随机场 (MRF) 模型对连通分量的特征向量进行分类，对初始分割进行细化。然后，将 COS/CCC 分割算法组合到多尺度框架中，以提高不同大小文本的分割准确性。与最先进的商业 MRC 产品和文献中选定的分割算法进行比较，我们表明新算法可以实现更高的文本检测精度，但具有更低的非文本特征的误检率。我们还证明，所提出的分割算法可以在降低比特率的同时提高解码文档的质量。

相似文献

Text segmentation for MRC document compression.MRC 文档压缩的文本分段。

IEEE Trans Image Process. 2011 Jun;20(6):1611-26. doi: 10.1109/TIP.2010.2101611. Epub 2010 Dec 23.

Script-independent text line segmentation in freestyle handwritten documents.自由手写文档中与脚本无关的文本行分割

IEEE Trans Pattern Anal Mach Intell. 2008 Aug;30(8):1313-29. doi: 10.1109/TPAMI.2007.70792.

Machine printed text and handwriting identification in noisy document images.噪声文档图像中的机器打印文本和手写识别。

IEEE Trans Pattern Anal Mach Intell. 2004 Mar;26(3):337-53. doi: 10.1109/TPAMI.2004.1262324.

Signature detection and matching for document image retrieval.用于文档图像检索的签名检测与匹配。

IEEE Trans Pattern Anal Mach Intell. 2009 Nov;31(11):2015-31. doi: 10.1109/TPAMI.2008.237.

A novel document ranking method using the discrete cosine transform.一种使用离散余弦变换的新型文档排序方法。

IEEE Trans Pattern Anal Mach Intell. 2005 Jan;27(1):130-5. doi: 10.1109/TPAMI.2005.2.

A parallel-line detection algorithm based on HMM decoding.一种基于隐马尔可夫模型解码的平行线检测算法。

IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):777-92. doi: 10.1109/TPAMI.2005.89.

Performance evaluation and benchmarking of six-page segmentation algorithms.六种页面分割算法的性能评估与基准测试

IEEE Trans Pattern Anal Mach Intell. 2008 Jun;30(6):941-54. doi: 10.1109/TPAMI.2007.70837.

High-quality MRC document coding.

IEEE Trans Image Process. 2006 Oct;15(10):3152-69. doi: 10.1109/tip.2006.877493.

Globally consistent reconstruction of ripped-up documents.全球范围内对撕裂文件的一致重建。

IEEE Trans Pattern Anal Mach Intell. 2008 Jan;30(1):1-13. doi: 10.1109/TPAMI.2007.1163.

Texture for script identification.用于脚本识别的纹理。

IEEE Trans Pattern Anal Mach Intell. 2005 Nov;27(11):1720-32. doi: 10.1109/TPAMI.2005.227.

MRC 文档压缩的文本分段。

Text segmentation for MRC document compression.

机构信息

School of Electrical and Computer Engineering, Purdue University,West Lafayette, IN 47907-2035, USA.

出版信息

IEEE Trans Image Process. 2011 Jun;20(6):1611-26. doi: 10.1109/TIP.2010.2101611. Epub 2010 Dec 23.

DOI:10.1109/TIP.2010.2101611

PMID:21189243

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

MRC 文档压缩的文本分段。

Text segmentation for MRC document compression.

机构信息

出版信息

相似文献

MRC 文档压缩的文本分段。

Text segmentation for MRC document compression.

机构信息

出版信息

相似文献