12.3. Using the Library: A Text-Query Program
FundamentalTo conclude our discussion of the library, we’ll implement a simple text-query program. Our program will let a user search a given file for words that might occur in it. The result of a query will be the number of times the word occurs and a list of lines on which that word appears. If a word occurs more than once on the same line, we’ll display that line only once. Lines will be displayed in ascending order—that is, line 7 should be displayed before line 9, and so on.
For example, we might read the file that contains the input for this chapter and look for the word element
. The first few lines of the output would be
element occurs 112 times
(line 36) A set element contains only a key;
(line 158) operator creates a new element
(line 160) Regardless of whether the element
(line 168) When we fetch an element from a map, we
(line 214) If the element is not found, find returns
followed by the remaining 100 or so lines in which the word element
occurs.
12.3.1. Design of the Query Program
FundamentalA good way to start the design of a program is to list the program’s operations. Knowing what operations we need can help us see what data structures we’ll need. Starting from requirements, the tasks our program must do include the following:
- When it reads the input, the program must remember the line(s) in which each word appears. Hence, the program will need to read the input a line at a time and break up the lines from the input file into its separate words
- When it generates output,
- The program must be able to fetch the line numbers associated with a given word
- The line numbers must appear in ascending order with no duplicates
- The program must be able to print the text appearing in the input file at a given line number.
These requirements can be met quite neatly by using various library facilities:
- We’ll use a
vector<string>
to store a copy of the entire input file. Each line in the input file will be an element in thisvector
. When we want to print a line, we can fetch the line using its line number as the index. - We’ll use an
istringstream
(§ 8.3, p. 321) to break each line into words. - We’ll use a
set
to hold the line numbers on which each word in the input appears. Using aset
guarantees that each line will appear only once and that the line numbers will be stored in ascending order. - We’ll use a
map
to associate each word with theset
of line numbers on which the word appears. Using amap
will let us fetch theset
for any given word.
For reasons we’ll explain shortly, our solution will also use shared_ptr
s.
Data Structures
Although we could write our program using vector, set
, and map
directly, it will be more useful if we define a more abstract solution. We’ll start by designing a class to hold the input file in a way that makes querying the file easy. This class, which we’ll name TextQuery
, will hold a vector
and a map
. The vector
will hold the text of the input file; the map
will associate each word in that file to the set
of line numbers on which that word appears. This class will have a constructor that reads a given input file and an operation to perform the queries.
The work of the query operation is pretty simple: It will look inside its map
to see whether the given word is present. The hard part in designing this function is deciding what the query function should return. Once we know that a word was found, we need to know how often it occurred, the line numbers on which it occurred, and the corresponding text for each of those line numbers.
The easiest way to return all those data is to define a second class, which we’ll name QueryResult
, to hold the results of a query. This class will have a print
function to print the results in a QueryResult
.
Sharing Data between Classes
Our QueryResult
class is intended to represent the results of a query. Those results include the set
of line numbers associated with the given word and the corresponding lines of text from the input file. These data are stored in objects of type TextQuery
.
Because the data that a QueryResult
needs are stored in a TextQuery
object, we have to decide how to access them. We could copy the set
of line numbers, but that might be an expensive operation. Moreover, we certainly wouldn’t want to copy the vector
, because that would entail copying the entire file in order to print (what will usually be) a small subset of the file.
We could avoid making copies by returning iterators (or pointers) into the TextQuery
object. However, this approach opens up a pitfall: What happens if the TextQuery
object is destroyed before a corresponding QueryResult
? In that case, the QueryResult
would refer to data in an object that no longer exists.
This last observation about synchronizing the lifetime of a QueryResult
with the TextQuery
object whose results it represents suggests a solution to our design problem. Given that these two classes conceptually “share” data, we’ll use shared_ptr
s (§ 12.1.1, p. 450) to reflect that sharing in our data structures.
Using the TextQuery
Class
When we design a class, it can be helpful to write programs using the class before actually implementing the members. That way, we can see whether the class has the operations we need. For example, the following program uses our proposed TextQuery
and QueryResult
classes. This function takes an ifstream
that points to the file we want to process, and interacts with a user, printing the results for the given words:
void runQueries(ifstream &infile)
{
// infile is an ifstream that is the file we want to query
TextQuery tq(infile); // store the file and build the query map
// iterate with the user: prompt for a word to find and print results
while (true) {
cout << "enter word to look for, or q to quit: ";
string s;
// stop if we hit end-of-file on the input or if a 'q' is entered
if (!(cin >> s) || s == "q") break;
// run the query and print the results
print(cout, tq.query(s)) << endl;
}
}
We start by initializing a TextQuery
object named tq
from a given ifstream
. The TextQuery
constructor reads that file into its vector
and builds the map
that associates the words in the input with the line numbers on which they appear.
The while
loop iterates (indefinitely) with the user asking for a word to query and printing the related results. The loop condition tests the literal true
(§ 2.1.3, p. 41), so it always succeeds. We exit the loop through the break
(§ 5.5.1, p. 190) after the first if
. That if
checks that the read succeeded. If so, it also checks whether the user entered a q
to quit. Once we have a word to look for, we ask tq
to find that word and then call print
to print the results of the search.
INFO
Exercises Section 12.3.1
Exercise 12.27: The TextQuery
and QueryResult
classes use only capabilities that we have already covered. Without looking ahead, write your own versions of these classes.
Exercise 12.28: Write a program to implement text queries without defining classes to manage the data. Your program should take a file and interact with a user to query for words in that file. Use vector
, map
, and set
containers to hold the data for the file and to generate the results for the queries.
Exercise 12.29: We could have written the loop to manage the interaction with the user as a do while
(§ 5.4.4, p. 189) loop. Rewrite the loop to use a do while
. Explain which version you prefer and why.
12.3.2. Defining the Query Program Classes
FundamentalWe’ll start by defining our TextQuery
class. The user will create objects of this class by supplying an istream
from which to read the input file. This class also provides the query
operation that will take a string
and return a QueryResult
representing the lines on which that string
appears.
The data members of the class have to take into account the intended sharing with QueryResult
objects. The QueryResult
class will share the vector
representing the input file and the set
s that hold the line numbers associated with each word in the input. Hence, our class has two data members: a shared_ptr
to a dynamically allocated vector
that holds the input file, and a map
from string
to shared_ptr<set>
. The map
associates each word in the file with a dynamically allocated set
that holds the line numbers on which that word appears.
To make our code a bit easier to read, we’ll also define a type member (§ 7.3.1, p. 271) to refer to line numbers, which are indices into a vector
of string
s:
class QueryResult; // declaration needed for return type in the query function
class TextQuery {
public:
using line_no = std::vector<std::string>::size_type;
TextQuery(std::ifstream&);
QueryResult query(const std::string&) const;
private:
std::shared_ptr<std::vector<std::string>> file; // input file
// map of each word to the set of the lines in which that word appears
std::map<std::string,
std::shared_ptr<std::set<line_no>>> wm;
};
The hardest part about this class is untangling the class names. As usual, for code that will go in a header file, we use std::
when we use a library name (§ 3.1, p. 83). In this case, the repeated use of std::
makes the code a bit hard to read at first. For example,
std::map<std::string, std::shared_ptr<std::set<line_no>>> wm;
is easier to understand when rewritten as
map<string, shared_ptr<set<line_no>>> wm;
The TextQuery
Constructor
The TextQuery
constructor takes an ifstream
, which it reads a line at a time:
// read the input file and build the map of lines to line numbers
TextQuery::TextQuery(ifstream &is): file(new vector<string>)
{
string text;
while (getline(is, text)) { // for each line in the file
file->push_back(text); // remember this line of text
int n = file->size() - 1; // the current line number
istringstream line(text); // separate the line into words
string word;
while (line >> word) { // for each word in that line
// if word isn't already in wm, subscripting adds a new entry
auto &lines = wm[word]; // lines is a shared_ptr
if (!lines) // that pointer is null the first time we see word
lines.reset(new set<line_no>); // allocate a new set
lines->insert(n); // insert this line number
}
}
}
The constructor initializer allocates a new vector
to hold the text from the input file. We use getline
to read the file a line at a time and push each line onto the vector
. Because file
is a shared_ptr
, we use the ->
operator to dereference file
to fetch the push_back
member of the vector
to which file
points.
Next we use an istringstream
(§ 8.3, p. 321) to process each word in the line we just read. The inner while
uses the istringstream
input operator to read each word from the current line into word
. Inside the while
, we use the map
subscript operator to fetch the shared_ptr<set>
associated with word
and bind lines
to that pointer. Note that lines
is a reference, so changes made to lines
will be made to the element in wm
.
If word
wasn’t in the map
, the subscript operator adds word
to wm
(§ 11.3.4, p. 435). The element associated with word
is value initialized, which means that lines
will be a null pointer if the subscript operator added word
to wm
. If lines
is null, we allocate a new set
and call reset
to update the shared_ptr
to which lines
refers to point to this newly allocated set
.
Regardless of whether we created a new set
, we call insert
to add the current line number. Because lines
is a reference, the call to insert
adds an element to the set
in wm
. If a given word occurs more than once in the same line, the call to insert
does nothing.
The QueryResult
Class
The QueryResult
class has three data members: a string
that is the word whose results it represents; a shared_ptr
to the vector
containing the input file; and a shared_ptr
to the set
of line numbers on which this word appears. Its only member function is a constructor that initializes these three members:
class QueryResult {
friend std::ostream& print(std::ostream&, const QueryResult&);
public:
QueryResult(std::string s,
std::shared_ptr<std::set<line_no>> p,
std::shared_ptr<std::vector<std::string>> f):
sought(s), lines(p), file(f) { }
private:
std::string sought; // word this query represents
std::shared_ptr<std::set<line_no>> lines; // lines it's on
std::shared_ptr<std::vector<std::string>> file; // input file
};
The constructor’s only job is to store its arguments in the corresponding data members, which it does in the constructor initializer list (§ 7.1.4, p. 265).
The query
Function
The query
function takes a string
, which it uses to locate the corresponding set
of line numbers in the map
. If the string
is found, the query
function constructs a QueryResult
from the given string
, the TextQuery file
member, and the set
that was fetched from wm
.
The only question is: What should we return if the given string
is not found? In this case, there is no set
to return. We’ll solve this problem by defining a local static
object that is a shared_ptr
to an empty set
of line numbers. When the word is not found, we’ll return a copy of this shared_ptr
:
QueryResult
TextQuery::query(const string &sought) const
{
// we'll return a pointer to this set if we don't find sought
static shared_ptr<set<line_no>> nodata(new set<line_no>);
// use find and not a subscript to avoid adding words to wm!
auto loc = wm.find(sought);
if (loc == wm.end())
return QueryResult(sought, nodata, file); // not found
else
return QueryResult(sought, loc->second, file);
}
Printing the Results
The print
function prints its given QueryResult
object on its given stream:
ostream &print(ostream & os, const QueryResult &qr)
{
// if the word was found, print the count and all occurrences
os << qr.sought << " occurs " << qr.lines->size() << " "
<< make_plural(qr.lines->size(), "time", "s") << endl;
// print each line in which the word appeared
for (auto num : *qr.lines) // for every element in the set
// don't confound the user with text lines starting at 0
os << "\t(line " << num + 1 << ") "
<< *(qr.file->begin() + num) << endl;
return os;
}
We use the size
of the set
to which the qr.lines
points to report how many matches were found. Because that set
is in a shared_ptr
, we have to remember to dereference lines
. We call make_plural
(§ 6.3.2, p. 224) to print time
or times
, depending on whether that size is equal to 1.
In the for
we iterate through the set
to which lines
points. The body of the for
prints the line number, adjusted to use human-friendly counting. The numbers in the set
are indices of elements in the vector
, which are numbered from zero. However, most users think of the first line as line number 1, so we systematically add 1 to the line numbers to convert to this more common notation.
We use the line number to fetch a line from the vector
to which file
points. Recall that when we add a number to an iterator, we get the element that many elements further into the vector
(§ 3.4.2, p. 111). Thus, file->begin() + num
is the num
th element after the start of the vector
to which file
points.
Note that this function correctly handles the case that the word is not found. In this case, the set
will be empty. The first output statement will note that the word occurred 0 times. Because *res.lines
is empty. the for
loop won’t be executed.
INFO
Exercises Section 12.3.2
Exercise 12.30: Define your own versions of the TextQuery
and QueryResult
classes and execute the runQueries
function from § 12.3.1 (p. 486).
Exercise 12.31: What difference(s) would it make if we used a vector
instead of a set
to hold the line numbers? Which approach is better? Why?
Exercise 12.32: Rewrite the TextQuery
and QueryResult
classes to use a StrBlob
instead of a vector<string>
to hold the input file.
Exercise 12.33: In Chapter 15 we’ll extend our query system and will need some additional members in the QueryResult
class. Add members named begin
and end
that return iterators into the set
of line numbers returned by a given query, and a member named get_file
that returns a shared_ptr
to the file in the QueryResult
object.