poi解析word和excel，并且获取其中文字、图片、音频和视频的位置-Toy模板网

这篇具有很好参考价值的文章主要介绍了poi解析word和excel，并且获取其中文字、图片、音频和视频的位置。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

1 目的

最近在做一个项目，要求解析出来word和excel中的一些属性，开始没当回事，以为很简单，但是做着做着发现不对劲，国内好像没人会有这种需求，也是费了很多事时间才找到方法，分享出去让你们少走弯路，我也是个新手，勿喷。

2 技术选型

当然是poi了，免费，文档全，下面是我用多的maven，直接上最新版本，干就完了。

        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>5.2.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>5.2.2</version>
        </dependency>

3 实现

属实很麻烦，老版本和新版本的解析方法还不一样。

3.1 xlsx

为什么先说xlsx，因为他是最简单的，首先是文本解析，没什么好说的直接上代码：

try (XSSFWorkbook workbook = new XSSFWorkbook(new FileInputStream(file)))
{
    // 获取sheet数量
    int numberOfSheets = workbook.getNumberOfSheets();
    for (int i = 0; i < numberOfSheets; i++)
    {
        // 读取每个sheet
        XSSFSheet sheet = workbook.getSheetAt(i);
        // 遍历每一排
        for (Row row : sheet)
        {
            // 遍历每一格，并且根据每一排进行合并
            StringBuilder s = new StringBuilder();
            for (Cell cell : row)
            {    
                // 防止出现特殊符号读取不到
                String value = new DataFormatter().formatCellValue(cell);
                s.append(value);
            }
        }
    }
}

没啥可说的i就是每一页，row就是每一行。

然后是图片，图片其实也挺简单的：

// 遍历形状获取图片和对象
XSSFDrawing drawing = sheet.createDrawingPatriarch();
List<XSSFShape> shapes = drawing.getShapes();
for (XSSFShape shape : shapes)
{
    // 获取图片
    if (shape instanceof XSSFPicture)
    {
        XSSFPicture picture = (XSSFPicture) shape;
        // 位置信息
        XSSFClientAnchor anchor = picture.getClientAnchor();
        // 图片所在的起始行
        int row1 = anchor.getRow1();
        // 获取图片数据
        byte[] data = picture.getPictureData().getData();
        // 获取文件类型
        String type =           
picture.getPictureData().getPackagePart().getContentTypeDetails().getSubType();
    }
}

图片也属于一种形状，所以遍历所有形状，然后获取位置和数据。

然后是视频和音频，这个就很离谱了，取出来并不是什么.mp4或者是什么.mp3文件，而是.bin文件，我们把xlsx后缀名改成zip然后到xl\embeddings中就能看到（其实想截图的，但是csdn上传不了图片了，靠）,我理解他其实是一个ole2文件，当咱们把一些视频文件嵌入到excel中时，excel会再进行封装一遍，所以常规的方法是解析不出来的。

我们首先先把.bin文件解析出来：

// 获取嵌入对象
if (shape instanceof XSSFObjectData)
{
    XSSFObjectData objectData = (XSSFObjectData) shape;
    if(objectData.getFileName().contains("bin"))
    {
        // .bin文件
        InputStream embeddedStream = objectData.getObjectPart().getInputStream();
    }
}

embeddedStream 其实就是.bin文件，但是我们不能直接使用，需要进一步解析，可以参照官网，利用POIFSFileSystem进行解析，写了个工具类：

https://poi.apache.org/components/poifs/fileformat.html

public static OfficeEmbed getEmbedInfo(InputStream i)
{
    try (POIFSFileSystem fs = new POIFSFileSystem(i))
    {
        Ole10Native ole10 = Ole10Native.createFromEmbeddedOleObject(fs.getRoot());
        // 文件名称
        String fileName = ole10.getLabel();
        // 后缀名
        String suffix = fileName.substring(fileName.lastIndexOf('.') + 1);
        // 字节
        byte[] b = ole10.getDataBuffer();
        if(StringUtils.isNotBlank(suffix))
        {
            return new OfficeEmbed(getFileType(suffix.toLowerCase()), suffix, b);
        }
    }
    catch (Ole10NativeException | IOException e) {
        e.printStackTrace();
    }
    return null;
}

继续继续，这样数据就能解析出来了，但是位置还是没有，请看下面：

// 解析出来的对象，OfficeEmbed 是我自己封装的对象
OfficeEmbed officeEmbed = OfficeUtils.getEmbedInfo(embeddedStream);
if(officeEmbed != null)
{
    String type = officeEmbed.getType();
    if(!OTHER.equals(type))
    {
        // 位置信息
        ChildAnchor chAnc = shape.getAnchor();
        if (chAnc instanceof ClientAnchor)
        {
            // 获取所在行
            ClientAnchor anc = (ClientAnchor) chAnc;
            // 获取字节
            byte [] b = officeEmbed.getB();
            // 获取后缀
            String suffix = officeEmbed.getSuffix();
        }
    }
}

3.2 xls

xls解析和xlsx差不多，但是嵌入对象解析有点不一样：

if (shape instanceof HSSFObjectData)
{
    HSSFObjectData objectData = (HSSFObjectData) shape;
    if(objectData.hasDirectoryEntry())
    {
        DirectoryNode dn = (DirectoryNode) objectData.getDirectory();
        OfficeEmbed officeEmbed = OfficeUtils.getEmbedInfo(dn);
    }
}

3.3 docx

word的话我是直接根据段数读取

try (XWPFDocument document = new XWPFDocument( new FileInputStream(file)))
{
    for(XWPFParagraph para : paragraphs)
    {
        // 获取段落文本
        String text = para.getText();
        // 获取图片和文件
        for(XWPFRun run : para.getRuns())
{
    // 获取此段落所有嵌入图像
    List<XWPFPicture> xwpfPictures = run.getEmbeddedPictures();
    for(XWPFPicture item : xwpfPictures)
    {
        // 字节
        byte[] b = item.getPictureData().getData();
        // 类型
        String type = item.getPictureData().getPackagePart().getContentType();
        // 后缀
        String suffix = type.substring(type.lastIndexOf('/') + 1);
    }
    // 嵌入文件
    List<CTObject> c = run.getCTR().getObjectList();
    for(CTObject item : c)
    {
        NodeList nn = item.getDomNode().getChildNodes();
        for(int j=0 ; j<nn.getLength() ; j++)
        {
            Node node = nn.item(j);
            if(node != null)
            {
                String s = node.getNodeName();
                if("o:OLEObject".equals(s))
                {
                    NamedNodeMap namedNodeMap = node.getAttributes();
                    String rId = namedNodeMap.getNamedItem("r:id").getNodeValue();
                    PackagePart packagePart = document.getPartById(rId);
                    OfficeEmbed officeEmbed = OfficeUtils.getEmbedInfo(packagePart.getInputS
                    if(officeEmbed != null)
                    {
                        String type = officeEmbed.getType();
                        if(!OTHER.equals(type))
                        {
                            byte [] b = officeEmbed.getB();
                            // 获取后缀
                            String suffix = officeEmbed.getSuffix();
                        }
                    }
                }
            }
        }
    }
}
    }
}

其实就是根据rid去xml中找文件去。

文本框这样获取，从外网抄过来的方法：

// 获取文本框
String rtx = "declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' declare namespace v='urn:schemas-microsoft-com:vml'.//*/wps:txbx/w:txbxContent | .//*/v:textbox/w:txbxContent";
XmlObject[] textBoxObjects =  para.getCTP().selectPath(rtx);
for (int j =0; j < textBoxObjects.length; j+=2)
{
    XWPFParagraph embeddedPara = null;
    XmlObject[] paraObjects = textBoxObjects[j].
            selectChildren(
                    new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));
    StringBuffer stringBuffer = new StringBuffer();
    for (int z=0 ; z<paraObjects.length ; z++)
    {
        embeddedPara = new XWPFParagraph(
                CTP.Factory.parse(paraObjects[z].xmlText()), para.getBody());
        //Here you have your paragraph;
        stringBuffer.append(embeddedPara.getText());
    }
    String textBox = stringBuffer.toString();
}

然后是表格：

List<IBodyElement> list  = document.getBodyElements();
for(IBodyElement bodyElement : list)
{
    if (bodyElement instanceof XWPFTable) {
        XWPFTable xwpfTable = (XWPFTable) bodyElement;
        // 位置
        int zxz = document.getPosOfTable(xwpfTable);
        String result = xwpfTable.getText();
    }
}

3.4 doc

最恶心的来了

try (HWPFDocument document = new HWPFDocument( new FileInputStream(file)))
{
    PicturesTable picturesTable = document.getPicturesTable();
// 处理文字
Range range = document.getRange();
for(int i=0 ; i<range.numParagraphs() ; i++)
{
    Paragraph paragraph = range.getParagraph(i);
    // 处理图片
    for (int j = 0; j < paragraph.numCharacterRuns(); j++)
    {
        CharacterRun run = paragraph.getCharacterRun(j);
        // 判断是否含有嵌入对象
        int picOffset = run.getPicOffset();
        if(run.isOle2())
        {
            OfficeEmbed officeEmbed = this.getOle2Result(document,picOffset);
            if(officeEmbed != null)
            {
                byte [] b = officeEmbed.getB();
                // 获取后缀
                String suffix = officeEmbed.getSuffix();
            }
        }
        // 获取图片
        if (picOffset >= 0)
        {
            Picture picture = picturesTable.extractPicture(run,true);
            if(picture != null)
            {
                byte[] b = picture.getContent();
                String type = picture.getMimeType();
                // 后缀
                String suffix = type.substring(type.lastIndexOf('/') + 1);
                if(!"emf".equals(suffix) && !"x-emf".equals(suffix))
                {
                    
                }
            }
        }
    }
    // 处理文本
    String paragraphText = paragraph.text();
    // 处理文本框
    int endOffset = paragraph.getEndOffset();
    String boxText = boxMap.get(endOffset + "");
}
}

文本框有特殊情况，文本框的位置是在_fspaMain中，我反正是没找到方法直接获取，我用反射获取的：

    /**
     * 反射获取_fspaMain，并且获取其中的所在段落
     * @param document 读取的doc文件
     * @return 段落数组
     */
    private List<String> getTextBoxPosition(HWPFDocument document) throws NoSuchFieldException, IllegalAccessException
    {
        List<String> strings = new ArrayList<>();
        java.lang.reflect.Field fspaField = HWPFDocument.class.getDeclaredField("_fspaMain");
        fspaField.setAccessible(true);
        FSPATable fspaMain = (FSPATable) fspaField.get(document);
        String s = fspaMain.toString();
        Matcher matcher = FPSA_PATTERN.matcher(s);
        while (matcher.find())
        {
            strings.add(matcher.group());
        }
        return strings;
    }

然后

// 反射得到_fspaMain属性
List<String> textBoxPosition = getTextBoxPosition(document);
Map<String, String> boxMap = new HashMap<>();
// 获取textbox中的值
if(CollectionUtil.isNotEmpty(textBoxPosition))
{
    Range range = document.getMainTextboxRange();
    StringBuilder stringBuffer = new StringBuilder();
    int sum = 0;
    for(int i=0 ; i<range.numParagraphs() ; i++)
    {
        Paragraph paragraph = range.getParagraph(i);
        String text = paragraph.text();
        boolean e = paragraph.isWidowControlled();
        if(e)
        {
            stringBuffer.append(text);
        }else
        {
            if(textBoxPosition.size() > sum)
            {
                stringBuffer.append(text);
                boxMap.put(textBoxPosition.get(sum), stringBuffer.toStr
                stringBuffer.setLength(0);
                sum ++;
            }
        }
    }
}

doc的嵌入文件也是不一样的，他是根据偏移量进行命名的，不看poi底层打死也找不到方法：

/**
     * 根据objId获取嵌入文件
     * @param doc doc文件
     * @param objId 对象id
     * @return 嵌入对象
     */
    private OfficeEmbed getOle2Result(HWPFDocument doc, int objId)
    {
        Entry entry = doc.getObjectsPool().getObjectById("_" + objId);
        if (entry == null) {
            log.info("Referenced OLE2 object '{}' not found in ObjectPool",objId);
            return null;
        }
        if(entry.isDirectoryEntry())
        {
            DirectoryNode dn = (DirectoryNode) entry;
            OfficeEmbed officeEmbed = OfficeUtils.getEmbedInfo(dn);
            if(officeEmbed != null)
            {
                String type = officeEmbed.getType();
                if(!OTHER.equals(type))
                {
                    return officeEmbed;
                }
            }
        }
        return null;
    }

over 感谢观看。文章来源地址https://www.toymoban.com/news/detail-730316.html

到了这里，关于poi解析word和excel，并且获取其中文字、图片、音频和视频的位置的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！