Chapter 1.4 Data: Its Representation, Structure and Management
1.4 (a) Number Systems and Character Sets
If my computer stores A as 01000001 and your computer stores A as 01000010 then the computers cannot communicate because they cannot understand each other’s codes. In the 1960’s a meeting in America agreed a standard set of codes so that computers could communicate with each other. This standard set of codes is known as the ASCII set. Most systems use ASCII so you can be fairly sure that when you type in A it is stored in the computer’s memory as 01000001.
1.4 (b) Data Types
Numeric data.
There are different types of numbers that the computer must be able to recognise.
Numbers can be restricted to whole numbers, these are called INTEGERS and are stored by the computer as binary numbers using a whole number of bytes. It is usual to use either 2 bytes (called short integers) or 4 bytes (called long integers), the difference being simply that long integers can store larger numbers. Sometimes it is necessary to store negative integers or fractions or, perhaps, some other types of numbers.
Boolean data
Sometimes the answer to a question is either yes or no, true or false. There are only two options. The computer uses binary data which consists of bits of information that can be either 0 or 1, so it seems reasonable that the answer to such questions can be stored as a single bit with 1 standing for true and 0 standing for false. Data which can only have two states like this is known as BOOLEAN data.
A simple example of its use would be in the control program for an automatic washing machine. One of the important pieces of information for the processor would be to know whether the door was shut. A boolean variable could be set to 0 if it was open and to 1 if it was shut. A simple check of that value would tell the processor whether it was safe to fill the machine with water.
Date/Time and Currency. The computer has simply been told the rules that govern such data types and then checks the data that is input against the rules..
Characters
A character can be anything, which is represented in the character set of the computer by a character code in a single byte.
1.4 (c) and (d) Expressing numbers in binary
These two sections can be combined. We are only interested in expressing numbers in binary form rather than in our decimal number system.
When a question asks for a conversion either to binary or back to decimal, always draw the box diagram that the numbers will be put into and put the headings on the boxes. The headings start from 1 on the left and then get multiplied by two each time, so that a question which wanted 8 bits for the answer would look like this
Then consider the number that needs turning in to binary. E.g. Turn 165 into binary.
Start on the left, in this case with 128. Does 128 go into 165? Yes. Put a 1 in the box.
128 has now been used up so take 128 from 165, there is 37 left.
Next box is 64. Does 64 go into 37? No. Put a 0 in the box.
Next box is 32. Does 32 go into 37? Yes. Put a 1 in the box.
32 has now been used up so take 32 from 37, there is 5 left.
Next box is 16. Does 16 go into 5? No. Put a 0 in the box.
Next box is 8. Does 8 go into 5? No. Put a 0 in the box.
Next box is 4. Does 4 go into 5? Yes. Put a 1 in the box.
4 has now been used up so take 4 from 5, there is 1 left.
Next box is 2. Does 2 go into 1? No. Put a 0 in the box.
Next box is 1. Does 1 go into 1? Yes. Put a 1 in the box.
1 has now been used up so take 1 from 1, there is 0 left.
No more boxes. End
(Notice that this is an algorithm which could be adapted into a general algorithm for working out binary numbers. Try it.)
The result is
To turn a number into a denary number from binary, put the number into the boxes, with the headings on and then just add up the headings that have a one in the box.
E.g.
Don’t worry about other numbers we will see those in chapter 3.4.
1.4 (e) Arrays
Data stored in a computer is stored at any location in memory that the computer decides to use. This means that similar pieces of data can be scattered all over memory. This, in itself, doesn’t matter to the user, except that to find each piece of data it has to be referred to by a variable name.
e.g. If it is necessary to store the 20 names of students in a group then each location would have to be given a different variable name. The first, Iram, might be stored in location NAME, the second, Sahin, might be stored in FORENAME, the third, Rashid, could be stored in CHRINAME, but I’m now struggling, and certainly 20 different variable names that made sense will be very taxing to come up with. Apart from anything else, the variable names are all going to have to be remembered.
Far more sensible would be to force the computer to store them all together using the variable name NAME. However, this doesn’t let me identify individual names, so if I call the first one NAME(1) and the second NAME(2) and so on, it is obvious that they are all peoples’ names and that they are distinguishable by their position in the list. Lists like this are called ARRAYS.
Because the computer is being forced to store all the data in an array together, it is important to tell the computer about it before it does anything else so that it can reserve that amount of space in its memory, otherwise there may not be enough space left when you want to use it. This warning of the computer that an array is going to be used is called INITIALISING the array. Initialising should be done before anything else so that the computer knows what is coming.
Initialising consists of telling the computer
· what sort of data is going to be stored in the array so that the computer knows what part of memory it will have to be stored in
· how many items of data are going to be stored, so that it knows how much space to reserve
· the name of the array so that it can find it again.
Different programming languages have different commands for doing this but they all do the same sort of thing, a typical command would be
DIM NAME$(20)
DIM is a command telling the computer that an array is going to be used
NAME is the name of the array
$ tells the computer that the data is going to be characters
(20) tells it that there are going to be up to 20 pieces of data.
To read data into the array simply tell the computer what the data is and tell it the position to place it in
e.g. NAME$(11) = Rashid will place Rashid in position 11 in the array (incidentally, erasing any other data that happened to be in there first).
To read data from the array is equally simple, tell the computer which position in the array and assign the data to another value
e.g. RESULT$ = NAME$(2) will place Sahin into a variable called RESULT$.
Searching for a particular person in the array involves a simple loop and a question
e.g. search for Liu in the array NAME$
Answer:
Counter = 1
While Counter is less than 21, Do
If NAME$(Counter) = Liu Then Print “Found” and End.
Else Add 1 to Counter
Endwhile
Print “Name not in array”
End
Notice that this is an algorithm written in pseudocode. Try to produce an equivalent algorithm using a Repeat…Until loop structure.
1.4 (f) Linked Lists
A linked list of data items tells the computer to store the data in any location and to link it to the previous data item by giving the previous data item the address of the new one. That sounds very complex, the idea is simple if we look at it in diagram form.
Note: The jagged line signifies that there are a number of others which would fit in there, but they are not shown.
To initialise a list, all that needs to be done is to create a new start pointer for this list and add it to the index of start pointers for all the other lists.
To search through the list for a particular piece of data follow these rules
1. Find the correct list in the index of lists
2. Follow the pointer to the next item
3. If the item is the one being searched for, report that it is found and end.
4. If the pointer shows that the end of the list has been reached, report that the item is not there and end.
5. Go to step 2.
(Try to write this algorithm in pseudocode using a while…endwhile loop)
To remove a value from a list, simply change the pointer that points to it into one that points to the next value after it. E.g. to remove Sahin from the example
Note that Sahin’s data is still there, it is just that there is no way of getting to it so it might just as well not be.
1.4 (g) Stacks and Queues
Queues.
Information arrives at a computer in a particular order, it may not be numeric, or alphabetic, but there is an order dependent on the time that it arrives. Imagine Zaid, Iram, Sahin, Rashid send jobs for printing, in that order. When these jobs arrive they are put in a queue awaiting their turn to be dealt with. It is only fair that when a job is called for by the printer that Zaid’s job is sent first because his has been waiting longest. These jobs are held, just like the other data we have been talking about, in an array. The jobs are put in at one end and taken out of the other. All the computer needs is a pointer showing it which one is next to be done (start pointer(SP)) and another pointer showing where the next job to come along will be put (end pointer(EP))
1. Zaid is in the queue for printing, the end pointer is pointing at where the next job will go.
2. Iram’s job is input and goes as the next in the queue, the end pointer moves to the next available space.
3. Zaid’s job goes for printing so the start pointer moves to the next job, also Sahin’s job has been input so the end pointer has to move.
Notes: The array is limited in size, and the effect of this seems to be that the contents of the array are gradually moving up. Sooner or later the queue will reach the end of the array. The queue does not have to be held in an array, it could be stored in a linked list. This would solve the problem of running out of space for the queue.
Stacks.
Imagine a queue where the data was taken off the array at the same end that it was put on. This would be a grossly unfair queue because the first one there would be the last one dealt with. This type of unfair queue is called a stack.
A stack will only need one pointer because adding things to it and taking things off it are only done at one end
1. Zaid and Iram are in the stack. Notice that the pointer is pointing to the next space.
2. A job has been taken off the stack. It is found by the computer at the space under the pointer (Iram’s job), and the pointer moves down one.
3. Sahin’s job has been placed on the stack in the position signified by the pointer, the pointer then moves up one. This seems to be wrong, but there are reasons for this being appropriate in some circumstances which we will see later in the course.
In a queue, the Last one to come In is the Last one to come Out. This gives the acronym LILO, or FIFO (First in is the first out).
In a stack, the Last one In is the First one Out. This gives the acronym LIFO, or FILO (First in is the last out).
1.4 (h) Files, Records, Items, Fields.
Data stored in computers is normally connected in some way. For example, the data about the 20 students in the set that has been the example over the last three sections has a connection because it all refers to the same set of people. Each person will have their own information stored, but it seems sensible that each person will have the same information stored about them, for instance their name, address, telephone number, exam grades…
All the information stored has an identity because it is all about the set of students, this large quantity of data is called a FILE.
Each student has their own information stored. This information refers to a particular student, it is called their RECORD of information. A number of records make up a file.
Each record of information contains the same type of information, name, address and so on. Each type of information is called a FIELD. A number of fields make up a record and all records from the same file must contain the same fields.
The data that goes into each field, for example “Iram Dahar”, “3671 Jaipur, 2415” will be different in most of the records. The data that goes in a field is called an ITEM of data.
(fixed length and variable length fields)
1.4 (i) Record Formats
To design a record format, the first thing to do is to decide what information would be sensible to be stored in that situation.
e.g. A teacher is taking 50 students on a rock-climbing trip. The students are being charged 20 dollars each and, because of the nature of the exercise, their parents may need to be contacted if there is an accident. The teacher decides to store the information as a file on a computer. Design the record format for the file.
The easiest way is to write them in a table
Student number Integer 1 byte
Student name Character 20 bytes
Amount paid Integer 1 byte
Emergency number Character 12 bytes
Form (e.g.3RJ) Character 3 bytes
1.4 (j) Sizing a File
We have just designed the record format for a given situation. It may be necessary to calculate how large the file is going to be.
Having decided on the size of each field, it is a simple matter of adding up the individual field sizes to get the size of a record, in this case 37 bytes.
There are 50 students going on the trip, each of them having their own record, so the size of the data in the file will be 50 * 37 = 1,850 bytes.
All files need a few extra pieces of information that the user may not see such as information at the start of the file saying when it was last updated, which file it is, is it protected in any way? These sort of extra pieces of information are known as overheads, and it is usual to add 10% to the size of a file because of the need for overheads. Therefore the size of the file is 1,850 bytes + (10% of 1,850 bytes) =2,035 bytes.
The final stage is to ensure that the units are sensible for the size of the file.
There are 1024 bytes in 1 Kbyte, so the size of this file is 2,035/1024 =1.99Kbytes.
1.4 (k) Access Methods to Data
Serial access.
Data is stored in the computer in the order in which it arrives. This is the simplest form of storage, but the data is effectively unstructured, so finding it again can be very difficult. This sort of data storage is only used when it is unlikely that the data will be needed again, or when the order of the data should be determined by when it is input. A good example of a serial file is what you are reading now. The characters were all typed in, in order, and that is how they should be read. Reading this book would be impossible if all the words were in alphabetic order.
Sequential access.
In previous sections of this chapter we used the example of a set of students whose data was stored in a computer. The data could have been stored in alphabetic order of their name. It could have been stored in the order that they came in a Computing exam, or by age with the oldest first. However it is done the data has been arranged so that it is easier to find a particular record. If the data is in alphabetic order of name and the computer is asked for Zaid’s record it won’t start looking at the beginning of the file, but at the end, and consequently it should find the data faster.
A file of data that is held in sequence like this is known as a sequential file.
Indexed sequential.
Imagine a large amount of data, like the names and numbers in a phone book. To look up a particular name will still take a long time even though it is being held in sequence. Perhaps it would be more sensible to have a table at the front of the file listing the first letters of peoples’ names and giving a page reference to where those letters start. So to look up Jawad, a J is found in the table which gives the page number 232, the search is then started at page 232 (where all the Js will be stored). This method of access involves looking up the first piece of information in an index which narrows the search to a smaller area, having done this, the data is then searched alphabetically in sequence. This type of data storage is called Index Sequential.
Random access.
A file that stores data in no order is very useful because it makes adding new data or taking data away very simple. In any form of sequential file an individual item of data is very dependent on other items of data. Jawad cannot be placed after Mahmood because that is the wrong ‘order’. However, it is necessary to have some form of order because otherwise the file cannot be read easily. What would be wonderful is if, by looking at the data that is to be retrieved, the computer can work out where that data is stored. In other words, the user asks for Jawad’s record and the computer can go straight to it because the word Jawad tells it where it is being stored
1.4 (l) Implementation of File Access Methods
Serial access.
Serial files have no order, no aids to searching, and no complicated methods for adding new data. The data is simply placed on the end of the existing file and searches for data require a search of the whole file, starting with the first record and ending, either with finding the data being searched for, or getting to the end of the file without finding the data.
Sequential access.
Because sequential files are held in order, adding a new record is more complex, because it has to be placed in the correct position in the file. To do this, all the records that come after it have to be moved in order to make space for the new one.
Having to manipulate the file in this way is very time consuming and consequently this type of file structure is only used on files that have a small number of records or files that change very rarelyLarger files might use this principle, but would be split up by using indexing into what amounts to a number of smaller sequential files.
e.g. the account numbers for a bank’s customers are used as the key to access the customer accounts. The accounts are held sequentially and there are approximately 1 million accounts. There are 7 digits in an account number.
Indexes could be set up which identify the first two digits in an account number. Dependent on the result of this first index search, there is a new index for the next two digits, which then points to all the account numbers, held in order, that have those first four digits. There will be one index at the first level, but each entry in there will have its own index at the second level, so there will be 100 indexes at the second level. Each of these indexes will have 100 options to point to, so there will be 10,000 blocks of data records. But each block of records will only have a maximum of 1000 records in it, so adding a new record in the right place is now manageable which it would not have been if the 1million records were all stored together.
Random access.
To access a random file, the data itself is used to give the address of where it is stored. This is done by carrying out some arithmetic (known as pseudo arithmetic because it doesn’t make much sense) on the data that is being searched for.
E.g. imagine that you are searching for Jawad’s data.
The rules that we shall use are that the alphabetic position of the first and last letters in the name should be multiplied together, this will give the address of the student’s data.
So Jawad = 10 * 04 = 40. Therefore Jawad’s data is being held at address 40 in memory.
This algorithm is particularly simplistic, and does not give good results, as we shall soon see, but it illustrates the principle. Any algorithm can be used as long as it remains the same for all the data.
This type of algorithm is known as a HASHING algorithm.
The problem with this example can be seen if we try to find Jaheed’s data.
Jaheed = 10 * 04 = 40. The data for Jaheed cannot be here because Jawad’s data is here. This is called a CLASH. When a clash occurs, the simple solution is to work down sequentially until there is a free space. So the computer would inspect address 41, and if that was being used, 42, and so on until a blank space. The algorithm suggested here will result in a lot of clashes which will slow access to the data. A simple change in the algorithm will eliminate all clashes. If the algorithm is to write down the alphabetic position of all the letters in the name as 2 digit numbers and then join them together there could be no clashes unless two people had the same name.
e.g. Jawad = 10, 01, 23, 01, 04 giving an address 1001230104
Jaheed = 10, 01, 08, 05, 05, 04 giving an address 100108050504
The problem of clashes has been solved, but at the expense of using up vast amounts of memory (in fact more memory than the computer will have at its disposal). This is known as REDUNDANCY. Having so much redundancy in the algorithm is obviously not acceptable. The trick in producing a sensible hashing algorithm is to come up with a compromise that minimizes redundancy without producing too many clashes.
1.4 (m) Selection of Data Types and Structures
1.4 (n) Backing up and Archiving Data
Backing up data.
Data stored in files is very valuable. It has taken a long time to input to the system, and often, is irreplaceable. If a bank loses the file of customer accounts because the hard disk crashes, then the bank is out of business.
It makes sense to take precautions against a major disaster. The simplest solution is to make a copy of the data in the file, so that if the disk is destroyed, the data can be recovered. This copy is known as a BACK-UP. In most applications the data is so valuable that it makes sense to produce more than one back-up copy of a file, some of these copies will be stored away from the computer system in case of something like a fire which would destroy everything in the building.
The first problem with backing up files is how often to do it. There are no right answers, but there are wrong ones. It all depends on the application. An application that involves the file being altered on a regular basis will need to be backed up more often than one that is very rarely changed (what is the point of making another copy if it hasn’t changed since the previous copy was made?). A school pupil file may be backed up once a week, whereas a bank customer file may be backed up hourly.
The second problem is that the back-up copy will rarely be the same as the original file because the original file keeps changing. If a back up is made at 9.00am and an alteration is made to the file at 9.05am, if the file now crashes, the back up will not include the change that has been made. It is very nearly the same, but not quite. Because of this, a separate file of all the changes that have been made since the last back up is kept. This file is called the transaction log and it can be used to update the copy if the original is destroyed. This transaction log is very rarely used. Once a new back up is made the old transaction log can be destroyed. Speed of access to the data on the transaction log is not important because it is rarely used, so a transaction log tends to use serial storage of the data and is the best example of a serial file if an examination question asks for one.
Archiving data.
Data sometimes is no longer being used. A good example would be in a school when pupils leave. All their data is still on the computer file of pupils, taking up valuable space. It is not sensible to just delete it, there are all sorts of reasons why the data may still be important, for instance a past pupil may ask for a reference. If all the data has been erased it may make it impossible to write a sensible reference. Data that is no longer needed on the file but may be needed in the future should be copied onto long term storage medium and stored away in case it is needed. This is known as producing an ARCHIVE of the data. (Schools normally archive data for 7 years before destroying it).
Note: Archived data is NOT used for retrieving the file if something goes wrong, it is used for storing little used or redundant data in case it is ever needed again, so that space on the hard drive can be freed up.
Example Questions
1. a) Express the number 113 (denary) in binary
using an appropriate number of bits. (2)
b) Change the binary number 10110010 into a decimal number. (2)
2. Describe how characters are stored in a computer. (3)
3.a) Explain what is meant by an integer data type. (2)
b) State what is meant by Boolean data.
(1)
4. An array is to be used to store information.
State three parameters that need to be given about the array before it can be used, explaining the reason why each is necessary. (6)
5. A garden centre stores details of each of the types of plant that it has for sale on a computer system. The details of the plants are stored in alphabetical order in a linked list.
a)By drawing a diagram show how the plants are arranged in the list. You may use the following plants to illustrate your answer
Pansy, Dahlia, Clematis, Sweet pea. (4)
b)Describe how a new linked list, of plants that like shaded conditions, can be created from the original one. (4)
6. A stack is being held in an array. Items may be read from the stack or added to the stack.
a) State a problem that may arise when
(i) adding a new value to the stack
(ii) reading a value from the stack. (2)
b) Explain how the stack pointer can be used by the computer to recognise when such problems may occur. (2)
7. A library stores details of the books that are available.
a) Apart from title and author, state 3 other fields that it would be sensible for the library to store in this file, giving a reason why each of your chosen fields would be necessary. (6)
b) State which field would be used as the key field of the record and explain why a key field is necessary. (2)
c) State the size of each of the fields in your record. (2)
d) If the library stores approximately 20,000 books, estimate the size the book file.
8. a)Explain the difference between a serial file and a sequential file. (2)
b)Describe what is meant by a hashing algorithm and explain why such an algorithm can lead to clashes. (3)
9. A library keeps both a book file and a member file. The library does a stock take twice a year and orders new books only once a year. Members can join or cancel their membership at any time.
a) Describe how the library can implement a sensible system of backing up their files. (4)
b) Explain the part that would be played by archiving in the management of the files (4)
0 comments:
Post a Comment