MATLAB Text Analytics Toolbox™ User's Guide [R2020a ed.]

Table of contents :
Text Data Preparation
Extract Text Data from Files
Prepare Text Data for Analysis
Parse HTML and Extract Text Content
Correct Spelling in Documents
Create Extension Dictionary for Spelling Correction
Create Custom Spelling Correction Function Using Edit Distance Searchers
Data Sets for Text Analytics
Modeling and Prediction
Create Simple Text Model for Classification
Analyze Text Data Using Multiword Phrases
Analyze Text Data Using Topic Models
Choose Number of Topics for LDA Model
Compare LDA Solvers
Create Co-occurrence Network
Analyze Text Data Containing Emojis
Create Simple Preprocessing Function
Train a Sentiment Classifier
Classify Text Data Using Deep Learning
Classify Text Data Using Convolutional Neural Network
Multilabel Text Classification Using Deep Learning
Sequence-to-Sequence Translation Using Attention
Classify Out-of-Memory Text Data Using Deep Learning
Pride and Prejudice and MATLAB
Word-By-Word Text Generation Using Deep Learning
Classify Out-of-Memory Text Data Using Custom Mini-Batch Datastore
Display and Presentation
Visualize Text Data Using Word Clouds
Visualize Word Embeddings Using Text Scatter Plots
Language Support
Language Considerations
Language-Independent Features
Japanese Language Support
Tokenization
Part of Speech Details
Named Entity Recognition
Stop Words
Lemmatization
Language-Independent Features
Analyze Japanese Text Data
German Language Support
Tokenization
Sentence Detection
Part of Speech Details
Named Entity Recognition
Stop Words
Stemming
Language-Independent Features
Analyze German Text Data
Korean Language Support
Tokenization
Part of Speech Details
Named Entity Recognition
Stop Words
Lemmatization
Language-Independent Features
Language-Independent Features
Word and N-Gram Counting
Modeling and Prediction
Glossary
Text Analytics Glossary
Documents and Tokens
Preprocessing
Modeling and Prediction
Visualization

Citation preview

Text Analytics Toolbox™ User's Guide

R2020a

How to Contact MathWorks Latest news:

www.mathworks.com

Sales and services:

www.mathworks.com/sales_and_services

User community:

www.mathworks.com/matlabcentral

Technical support:

www.mathworks.com/support/contact_us

Phone:

508-647-7000

The MathWorks, Inc. 1 Apple Hill Drive Natick, MA 01760-2098 Text Analytics Toolbox™ User's Guide © COPYRIGHT 2017–2020 by The MathWorks, Inc. The software described in this document is furnished under a license agreement. The software may be used or copied only under the terms of the license agreement. No part of this manual may be photocopied or reproduced in any form without prior written consent from The MathWorks, Inc. FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation by, for, or through the federal government of the United States. By accepting delivery of the Program or Documentation, the government hereby agrees that this software or documentation qualifies as commercial computer software or commercial computer software documentation as such terms are used or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms and conditions of this Agreement and only those rights specified in this Agreement, shall pertain to and govern the use, modification, reproduction, release, performance, display, and disclosure of the Program and Documentation by the federal government (or other entity acquiring for or through the federal government) and shall supersede any conflicting contractual terms or conditions. If this License fails to meet the government's needs or is inconsistent in any respect with federal procurement law, the government agrees to return the Program and Documentation, unused, to The MathWorks, Inc.

Trademarks

MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders. Patents

MathWorks products are protected by one or more U.S. patents. Please see www.mathworks.com/patents for more information. Revision History

March 2018 September 2018 March 2019 September 2019 March 2020

Online Only Online Only Online Only Online Only Online Only

New for Version 1.1 (Release 2018a) Revised for Version 1.2 (Release 2018b) Revised for Version 1.3 (Release 2019a) Revised for Version 1.4 (Release 2019b) Revised for Version 1.5 (Release 2020a)

Contents

1

Text Data Preparation Extract Text Data from Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1-2

Prepare Text Data for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-10

Parse HTML and Extract Text Content . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-17

Correct Spelling in Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-21

Create Extension Dictionary for Spelling Correction . . . . . . . . . . . . . . . .

1-23

Create Custom Spelling Correction Function Using Edit Distance Searchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-27

Data Sets for Text Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-33

Modeling and Prediction Create Simple Text Model for Classification . . . . . . . . . . . . . . . . . . . . . . . .

2-2

Analyze Text Data Using Multiword Phrases . . . . . . . . . . . . . . . . . . . . . . . .

2-7

Analyze Text Data Using Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-13

Choose Number of Topics for LDA Model . . . . . . . . . . . . . . . . . . . . . . . . .

2-19

Compare LDA Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-23

Create Co-occurrence Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-28

Analyze Text Data Containing Emojis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-32

Create Simple Preprocessing Function . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-38

Train a Sentiment Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-41

...........................................................

2-48

Classify Text Data Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . .

2-49

iii

3

4

iv

Contents

Classify Text Data Using Convolutional Neural Network . . . . . . . . . . . . .

2-57

Multilabel Text Classification Using Deep Learning . . . . . . . . . . . . . . . . .

2-66

Sequence-to-Sequence Translation Using Attention . . . . . . . . . . . . . . . .

2-86

Classify Out-of-Memory Text Data Using Deep Learning . . . . . . . . . . . .

2-106

Pride and Prejudice and MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-112

Word-By-Word Text Generation Using Deep Learning . . . . . . . . . . . . . .

2-118

Classify Out-of-Memory Text Data Using Custom Mini-Batch Datastore ........................................................

2-124

Display and Presentation Visualize Text Data Using Word Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3-2

Visualize Word Embeddings Using Text Scatter Plots . . . . . . . . . . . . . . . .

3-8

Language Support Language Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language-Independent Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-2 4-3

Japanese Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part of Speech Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language-Independent Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-5 4-5 4-5 4-6 4-7 4-8 4-8

Analyze Japanese Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-10

German Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sentence Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part of Speech Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language-Independent Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-20 4-20 4-20 4-21 4-22 4-23 4-23 4-24

Analyze German Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-25

5

Korean Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part of Speech Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language-Independent Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-36 4-36 4-36 4-36 4-36 4-36 4-36

Language-Independent Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Word and N-Gram Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-38 4-38 4-38

Glossary Text Analytics Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Documents and Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5-2 5-2 5-3 5-3 5-5

v

1 Text Data Preparation • “Extract Text Data from Files” on page 1-2 • “Prepare Text Data for Analysis” on page 1-10 • “Parse HTML and Extract Text Content” on page 1-17 • “Correct Spelling in Documents” on page 1-21 • “Create Extension Dictionary for Spelling Correction” on page 1-23 • “Create Custom Spelling Correction Function Using Edit Distance Searchers” on page 1-27 • “Data Sets for Text Analytics” on page 1-33

1

Text Data Preparation

Extract Text Data from Files This example shows how to extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis. Usually, the easiest way to import text data into MATLAB is to use the extractFileText function. This function extracts the text data from text, PDF, HTML, and Microsoft Word files. To import text from CSV and Microsoft Excel files, use readtable. To extract text from HTML code, use extractHTMLText. To read data from PDF forms, use readPDFFormData. Text File Extract the text from sonnets.txt using extractFileText. The file sonnets.txt contains Shakespeare's sonnets in plain text. filename = "sonnets.txt"; str = extractFileText(filename);

View the first sonnet by extracting the text between the two titles "I" and "II". start = " I" + newline; fin = " II"; sonnet1 = extractBetween(str,start,fin) sonnet1 = " From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. "

Microsoft Word Document Extract the text from sonnets.docx using extractFileText. The file exampleSonnets.docx contains Shakespeare's sonnets in a Microsoft Word document. filename = "exampleSonnets.docx"; str = extractFileText(filename);

View the second sonnet by extracting the text between the two titles "II" and "III". start = " II" + newline; fin = " III"; sonnet2 = extractBetween(str,start,fin)

1-2

Extract Text Data from Files

sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "

The example Microsoft Word document uses two newline characters between each line. To replace these characters with a single newline character, use the replace function. sonnet2 = replace(sonnet2,[newline newline],newline) sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "

1-3

1

Text Data Preparation

PDF Files Extract text from PDF documents and data from PDF forms. PDF Document Extract the text from sonnets.pdf using extractFileText. The file exampleSonnets.pdf contains Shakespeare's sonnets in a PDF. filename = "exampleSonnets.pdf"; str = extractFileText(filename);

View the third sonnet by extracting the text between the two titles "III" and "IV". This PDF has a space before each newline character. start = " III " + newline; fin = "IV"; sonnet3 = extractBetween(str,start,fin) sonnet3 = " Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee. "

PDF Form To read text data from PDF forms, use readPDFFormData. The function returns a struct containing the data from the PDF form fields. filename = "weatherReportForm1.pdf"; data = readPDFFormData(filename) data = struct with fields: event_type: "Thunderstorm Wind" event_narrative: "Large tree down between Plantersville and Nettleton."

HTML Extract text from HTML files, HTML code, and the web.

1-4

Extract Text Data from Files

HTML File To extract text data from a saved HTML file, use extractFileText. filename = "exampleSonnets.html"; str = extractFileText(filename);

View the forth sonnet by extracting the text between the two titles "IV" and "V". start = newline + "IV" + newline; fin = newline + "V" + newline; sonnet4 = extractBetween(str,start,fin) sonnet4 = " Unthrifty loveliness, why dost thou spend Upon thy self thy beauty's legacy? Nature's bequest gives nothing, but doth lend, And being frank she lends to those are free: Then, beauteous niggard, why dost thou abuse The bounteous largess given thee to give? Profitless usurer, why dost thou use So great a sum of sums, yet canst not live? For having traffic with thy self alone, Thou of thy self thy sweet self dost deceive: Then how when nature calls thee to be gone, What acceptable audit canst thou leave? Thy unused beauty must be tombed with thee, Which, used, lives th' executor to be. "

HTML Code To extract text data from a string containing HTML code, use extractHTMLText. code = "THE SONNETS

by William Shakespeare

"; str = extractHTMLText(code) str = "THE SONNETS by William Shakespeare"

From the Web To extract text data from a web page, first read the HTML code using webread, and then use extractHTMLText. url = "https://www.mathworks.com/help/textanalytics"; code = webread(url); str = extractHTMLText(code)

str = 'Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing,

Text Analytics Toolbox includes tools for processing raw text from sources such as equipment

1-5

1

Text Data Preparation

Using machine learning techniques such as LSA, LDA, and word embeddings, you can find cluste

Parse HTML Code To find particular elements of HTML code, parse the code using htmlTree and use findElement. Parse the HTML code and find all the hyperlinks. The hyperlinks are nodes with element name "A". tree = htmlTree(code); selector = "A"; subtrees = findElement(tree,selector);

View the first 10 subtrees and extract the text using extractHTMLText. subtrees(1:10) ans = 10×1 htmlTree: