February 21, 2006

Extended Instructions for Ones and Twos

1. How do I know when "there's only a data dictionary file available to
serve as the source document"?

A: The source file will have the extension .dct and there wont be any other resources for it.

2. Should I always be sure to check the title of a document on the Census
website?
A. This is not necessary as long as you have the file you are working from, which you will find in the “Incoming? box, but, for part two it might be a good idea if you want to check- at least for the sake of the “alternative title? slot.

3. "If your source is an ICPSR codebook, the agency is ICPSR and the ID
number is the study number, which is always printed on the first page of
ICPSR codebooks"
What if its not an ICPSR codebook? do I just erase the IDNo markup? I don't
think I've yet seen an ICPSR codebook... what do they look like?
A. If your source is from the Library, it is the call number. Other than these tow circumstances, you don’t need to worry about filling in the IDNo slot.

4. What exactly does reffer to?
A: This is who is responsible for the work. In part 1, it will be the MPC. In part 2 it will be the ICPSR or the Bureau of the Census.

5. Is "The year the source documentation was produced," the year that the
information was gathered, or the year that it was put together and/or
released as a cohesive work? (I've been using the latter)
A. This is the date that the version of the file which we are using was published/put out for people to use. It may be different than the year that the information was gather and deals with. For the Census Bureau files, the month and day of collection and the month and day of production will always be April 1st, xxxx (year).

6. I am very confused about version type. The instructions say "It’ll
either be a type of release or an edition and ICPSR is always edition." But
I dont know where to find out that information within the files... Oh, and
in the first file I did one and two for, you said to just delete that, so
that's what I've been doing.
A: ICPSR files will always be an “edition? and Census files will almost always be a “release.? You should be able to find this within the file that’s in the “Incoming.? If you can’t, it’s not a big deal.

7. I'm not *entirely* clear on the difference between "Citation for the
source documentation, not the source study" I think I've done alright on
this so far, but I think I found the seperate files mostly by luck... How
much difference is there usually between these two?
A: Usually there isn’t much difference if any between the two.

8. What would "multiple source documents" consist of, and how would I know?
Would I have been working out of all of them? (Hasn't been a problem...)
A: Unfortunate LocMap makers will not be dealing with any multi-source files. It might happen in the future, but you will then need to ask Amy W. for her help with citing them.

9. Where would I usually find the abstracts for the various files? Do I
paste in the entire thing?
A: The abstract for the file can be found on either the Census website, the ICPSR website, or the Incoming file box, or, if it is from a library material, you may find it on the University library catalog site, in the description of the book/cd/etc. It is not important which abstract you use so long as it adequately explains what the data contained in the study is about. However, we would like this to be kapt succinct, and the abstract off of the Census website tend to run a page or so long, so you might want to either cut them or use a different abstract.

12. “collDate? versus “timeperiod?? (Does the April first rule always aply?)
A: The collection date is the collection date, and timeperiod means the production time. Once again, collection date will always be April 1st because the Census does all the actual data collection. The date and year depend on who’s file we’re using- whether it was produced by ICPSR or produced by the Census. (If it was produced byt the Census, once again it will be an April 1st file.)

13.geogCover, geogUnit, universe, are all things that I'm not seeing in
your instructions. So far, I've grasped the cover pretty well (USA), and figure
out the unit after a bit of searching, but, sorry to have to ask an obvious sounding question, but what is my universe? (Winter break has erased odd things from my mind...)
A: If you can’t find this on the “Incoming? file, you will find it on the Census or ICPSR website’s version of the file. This is an important element to make sure you get right.

14. dataCollector would just be the census or the ICPSR again, right?
A: The data collector will always always be the Bureau of the Census.

15. Whats this?
A: We don’t know. We leave it blank.

16. Whats this? (Its under "use statement..."
A: This is just the same old document Citation again.

January 6, 2006

Configuring TextPad to Make it Easier to Use

Some TextPad Modifications that might come in handy:

You can set your own shortcuts for commands as follows:
From the Configure menu, choose Preferences. The Preferences dialog box is displayed.
Select the Keyboard page.
Select the command category from the list, and the corresponding commands in that category will be displayed.
Select the command you wish to set a shortcut for.
Type the shortcut in the "Press new shortcut key" box. It may consist of one or two characters.
Click the Assign button.

Example: change the current Find command of F5 to "CTRL+F" like in Internet Explorer

You can customize these view settings by choosing Preferences from the Configure menu:

Line numbers Display line numbers in the left margin of each view. This option changes the default for all documents. Use the command on the View menu to change it for the active document only.

Tabbed document selector Control whether it is displayed at the top or bottom of the window, and whether it is stacked or scrolling. Use the command on the View menu to display or hide the tabs.

Add the xml.syn and dditags.tcl to TextPad:

Get them from aggdata\METADATA\WORKING\Instructions and save them to the "[user profile]\Application Data\TextPad" folder

LocMaps for SSF files

When you make locMaps for SSTF files, there are four parts to the process.
You will map:
A. the variables not in nCubes at the beginning of each line
B. the line numbers at the end of each line
C. the nCubes
D. the filler spaces after the nCubes


You will need:
1. aggdata\\INCOMING\1990-SSF\whichever one you're working on\Docs\HOWTOUSE.ASC (if more than one, use all)*
2. aggdata\\INCOMING\1990-SSF\whichever one you're working on\Docs\IDEN_FTN.ASC (if more than one, use all)**
3. aggdata\\INCOMING\1990-SSF\whichever one you're working on\Docs\TBL_OUT.ASC (if more than one, use all)
4. putty.exe (download from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)
5. aggdata\Programs\LocMap_normal_var.pl (copy from here to the directory where your file is)
6. aggdata\Programs\LocMap_nCube_3.pl (copy from here to the directory where your file is)
7. TextPad
8. an extra file for storing the locMap parts
9. the xml file for which the locMap is being created
10. an unzipped copy of the data file in stored in aggdata\TRANSLATED\1990-SSF available for viewing.


*IDEN_FTN.ASC has the record layout (with file widths, etc) for the geographic and other non-nCube variables (denoted with an "M" for misc.).
**HOWTOUSE.ASC has the record layout for the nCubes.


I. Print or have open for viewing 1 and 2.


II. Start putty following the instructions below:
putty.gif
You'll get a black screen asking you to log in. Use your mpc username and password. Then navigate to the directory where the files and programs are with this command:
cd /pkg/aggdata/METADATA/dropbox/amy
If you are not working in dropbox, change as appropriate.


III. As needed, use these UNIX commands:
a1. ls = show me everything in my directory
a2. hitting the up arrow will enter in text you've entered so far in your session. So, if you've run a locMap program once, you don't have to type it again. Just hit the up arrow.
a3. Paste copied text by hitting the right mouse button. Inside putty, highlighting text also copies it.
a4. You can use the Tab key to fill out the rest of a directory name if that directory name is unique. Soooo..."cd /pkg/agg" + the Tab key fills out the rest of aggdata. It only takes one capital M to get metadata and one lower case d to get dropbox.


IV. Map the variables not in nCubes at the beginning of each line
Start LocMap_normal_var_v2.pl (see below).
Bold text = prompts from the LocMap_normal_var_v2.pl program; the rest are examples of what you would enter.

./LocMap_normal_var.pl
Enter file name: STF3-1-2-3-4.xml
Enter the variable ID of the first and last variable of your range (ex:U18-U20): ***
Mb>Input the number of repetitions for record identification strings that occur on EACH physical part of a longer logical record: 0 (or 1 or 16 or whatever) ****
Input the start position for first dataItem: 1 (always)
Input Variable widths:
Variable ID U18: 3 (these numbers come from IDEN_FTN.ASC)
Variable ID U19: 6
Variable ID U20: 2
Would you like to add any FILLS in this range(y/n): n (always n)

***Do not include the FILL or LINENO (LINENO= line number) variables in this group.

****The number of repetitions refers to record segments which are listed in the HOWTOUSE.ASC file. Open the file and do a find for "Segmentation of SSTF{fill in the number} Records". Count the TOTAL number of segments (whether a, b, c or whatever) and that's what you enter here. So a file with 1 A segment and 12 B segments has 13 total and you should enter 13. The image below shows the start of a segment listing (ignore the green text/arrows for now):

howtouse.gif

When you finish, an output file of the pattern "LocMap_normal_var.output" will be created in the same directory.


V. Save contents of LocMap_normal_var output to 7 and save.


VI. Map the line numbers at the end of each line
-Open the data file.
-Hit the End key.
-Look at the bottom of the Textpad window for the character number (see image).
-That's the line length.
-Scroll down to the last line of the file.
-Note the line number (2043 or 77 or whatever).
-Open 7.
-Copy the last dataitem.
-Change the ID to "DI_LINENO"
-Change varRef to "LINENO"
-Change the endPos to the line length (8174 in the image above)
-Change the width to the number of characters of the last line number (4 for 2043, 2 for 77)
-Adjust the startPos accordingly.
-Save this dataitem just above the </locMap> tag.

Example:
<dataItem ID="DI_LINENO" source="producer" varRef="LINENO">
<physLoc source="producer" recRef="REC_1" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_2" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_3" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_4" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_5" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_6" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_7" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_8" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_9" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_10" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_11" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_12" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_13" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_14" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_15" startPos="8172" width="3" endPos="8174" />
</dataItem>


VI. Map the nCubes
Bold text = prompts from the LocMap_nCube_3.pl program; the rest are examples of what you would enter.
The program goes like this:

./LocMap_nCube_3.pl
Enter file name:
Enter the ID of the first and last nCube of your range (ex: NPB010-NPB015)
Enter recRef:
Enter Start position:
Enter width:

The program ends and an output file is created.

Now, SSTF files are broken into segments. These segments often start and/or end in the middle of an nCube. The program above can't capture this kind of break - it can only work with a whole table at a time. This means you have to construct the locMap segment by segment using the HOWTOUSE.ASC file.

Because the geography variables should all end at position 300, then each record segment (which is equal to 1 or more lines in the data file) should start at 301. In other words, multiple data items reference the same space. This is ok because each data item references a different record group, so each item is still a unique combination of record group and position.

If each segment broke at the end of each table, then you would run the LocMap_nCube_3.pl program once for each record segment. Instead, as you can see below, segments begin and/or end in the middle of tables.

howtouse.gif

Or, to put it this way:

recgroup.gif

LocMap_nCube_3.pl doesn't account for these kinds of breaks. Therefore, not only do you have to run the LocMap_nCube_3.pl program once for each record segment, but you have to adjust your starting position for each segment so that the whatever data item is at position 301 is the one you are supposed to start with.

For example, based on the image above:

The input for record A (REC_1) would be:
range: NPA001-NHA003
record ID: REC_1
start position: 301
width: 9

The input for record B segment 1 (REC_2) would be:
range: NPB001-NPB007
record ID: REC_2
start position: 301
width: 9

NOTE: ALL dataItems starting with the 32nd data item in table NPB007 will be deleted as they are NOT in this segment.

The input for record B segment 1 (REC_3) would be:
range: NPB007-NPB018
record ID: REC_3
start position: 22
width: 9

NOTE: All dataItems BEFORE NPB007 cell 32 will be deleted (cell 32 should have a start position of 301). All dataItems starting with the 73rd dataItem in table NPB018 will be deleted as they are NOT in this segment.

The input for Record B segment 3 would be:
range: NPB018-NPB022
record ID: REC_4
start position: 301 - (72*9) = -347
width: 9

NOTE: Same type of editing as before. Note that dataItem 73 in NPB18 starts in position 301. Because there are more than 300 characters in 72 cells (648) the start location you enter has to be a negative number to get the first cell of this table located in this segment to start at 301. All dataItems starting with the 134th dataItem in table NPB022 will be deleted as they are NOT in this segment.

You repeat this process until you reach the end of the record segments.


VII. Map the filler.
On each line, after the nCube data, but before the line numbers will be some blank spaces. We account for these using the FILL variable.
-After creating each locMap segment in VI, go to the last dataitem.
-Copy this:

<dataItem ID="DI_FILL#" source="producer" varRef="FILL">
<physLoc source="producer" recRef="REC_#" startPos="#" width="#" endPos="#" />
</dataItem>

-Each fill is sequential starting with 1.
-The REC_# should match the one it's filling out.
-The startPos = (endPos of preceding dataitem + 1)
-The endPos = (startPos of the lineno dataitem - 1)
-The width is the difference


VIII. Map "FILLER Tables" as filler variables.
In some cases, a record segment ends in the middle of a "FILLER Table". That means that the next segment begins with a FILLER Table and you need to calculate the correct start and end positions for the FILL variable that replaces the table and the first nCube in that segment.

Say you have two segments like this:

Segment 9

Geographic Identification PB40--17 data cells--through
Information PB44--259 data cells

8,165 characters including
5 characters filler

Segment 10

Geographic Identification PB44--227 data cells--through
Information PB45--388 data cells

8,165 characters including
2 characters filler

Check in TBL_OUT.ASC to see if either of the tables at the beginning or end is a FILLER Table. It will look like this:

PB47. FILLER

In those cases, before re-running LocMap_nCube_3.pl, insert another blank FILL variable.

-The REC_# should match the one it's filling out.
-The startPos = 301
-The endPos = 301+(the number of data cells for that table x 9) (in the example above, it would be 301+(227x9) = 2344
-The width is the difference

When you run LocMap_nCube_3.pl now, your nCube range is just PB45 and the startPos is 2344+1.
Then clip off the extra dataitems as in V.

IX.
When you have finished, copy the contents of your holding file (e.g. the whole locMap) into your original xml file immediately above <dataDscr>. Save and email me that you've finished.

December 16, 2005

Resources for creation sections 1 & 2

When you're putting in the info in sections 1 & 2 these resources will be handy:

1. The DDI schema - helps in case you get lost in repeating elements: http://webapp.icpsr.umich.edu/cocoon/DDI-STRUCTURE/Version2-1.xsd

2. The ICPSR Study description (if your file has one): http://www.icpsr.umich.edu/
2.1 Enter the ICPSR study number (on the first page of the documentation) and select "Study No" in the search field.
2.2 Wnen you get the results, click on "Description".

3. The full documentation file which will be in aggdata/INCOMING/xxxxx If you don't see anything that looks like your file, try looking in ICPSR-Studies in INCOMING.

4. To find the other people who've done data entry on the files, go to aggdata/METADATA/WORKING/amy/Excel/Amy_tracking.xls; list everyone associated...

Sandy and Amy K (although she's gone now till Spring Semester starts) have walked through this and can help...

If you need Amy M-Th...

If you need to ask me something on a day other than Friday, feel free to email or come to the library. You can see if I'll be available by going to my public schedule.

December 9, 2005

Valid/Invalid Ranges for Geo Vars

In the case when a geo var has a range (eg 01-52, 98, 99) of possible catValus we should use a "valid/invalid range." For instance, in the 2000 SF1 the variable "Congressional District 110" is listed in the data dictionairy. For this variable, the values 01-52 correspond to "The Actual congressional district number",value 00 "Applies to states whose representative is elected ‘‘at large’’; i.e., the state has only one representative in the United States House of Representatives" and so on.


<var ID="G52" name="CD110">
<location locMap="LM"/>
<labl level="var">Congressional District (110th)</labl>
<valrng>
<range min="00" max="52"/>
<range min="98" max="99"/>
<key>01-52: The Actual congressional district number; 00: Applies to states whose representative is elected ‘‘at large’’; i.e., the state has only one representative in the United States House of Representatives; 98: Applies to areas that have an ‘‘at large’’ nonvoting delegate or resident commissioner in the United States House of Representatives; 99: Applies to areas that have no representation in the United States House of Representatives</key>
</valrng>
</var>

We should use valid/invalid range when the codes we are referencing aren't commonly known. A couple examples of commonly known schemes are FIPS or MCD. It's not always so easy to know when coding schemes are "commonly known" so if there is any question ask either Wendy or Amy.

December 2, 2005

Blog's Nearly Back

So, our blog disappeared a while back. I did recover the old entries and they should be getting imported any day now. In the meantime, this one's ready to start using for new entries. You'll note that Jeff has already added an entry.

The address is http://blog.lib.umn.edu/mpc/unfor

Everyone receiving this message can add entries to the blog except Wendy because she has previously indicated that she doesn't want to :) You would go to the address above, click on Login and enter your regular UofM username and password.

If you add something you want to be sure everyone sees or that is time-sensitive, after you save the entry, the screen will reload and in the light blue bar at the top is a link for notification. Click that and choose whether you want to send part of the entry, all of it or none of it and submit. An email will be sent to all unfortunates, Jennifer, Wendy and me.

If you are new to blogs, let me know and I'll give you a quick introduction!

NHGIS - Unfortunates

February 21, 2006

Extended Instructions for Ones and Twos

1. How do I know when "there's only a data dictionary file available to
serve as the source document"?

A: The source file will have the extension .dct and there wont be any other resources for it.

2. Should I always be sure to check the title of a document on the Census
website?
A. This is not necessary as long as you have the file you are working from, which you will find in the “Incoming? box, but, for part two it might be a good idea if you want to check- at least for the sake of the “alternative title? slot.

3. "If your source is an ICPSR codebook, the agency is ICPSR and the ID
number is the study number, which is always printed on the first page of
ICPSR codebooks"
What if its not an ICPSR codebook? do I just erase the IDNo markup? I don't
think I've yet seen an ICPSR codebook... what do they look like?
A. If your source is from the Library, it is the call number. Other than these tow circumstances, you don’t need to worry about filling in the IDNo slot.

4. What exactly does reffer to?
A: This is who is responsible for the work. In part 1, it will be the MPC. In part 2 it will be the ICPSR or the Bureau of the Census.

5. Is "The year the source documentation was produced," the year that the
information was gathered, or the year that it was put together and/or
released as a cohesive work? (I've been using the latter)
A. This is the date that the version of the file which we are using was published/put out for people to use. It may be different than the year that the information was gather and deals with. For the Census Bureau files, the month and day of collection and the month and day of production will always be April 1st, xxxx (year).

6. I am very confused about version type. The instructions say "It’ll
either be a type of release or an edition and ICPSR is always edition." But
I dont know where to find out that information within the files... Oh, and
in the first file I did one and two for, you said to just delete that, so
that's what I've been doing.
A: ICPSR files will always be an “edition? and Census files will almost always be a “release.? You should be able to find this within the file that’s in the “Incoming.? If you can’t, it’s not a big deal.

7. I'm not *entirely* clear on the difference between "Citation for the
source documentation, not the source study" I think I've done alright on
this so far, but I think I found the seperate files mostly by luck... How
much difference is there usually between these two?
A: Usually there isn’t much difference if any between the two.

8. What would "multiple source documents" consist of, and how would I know?
Would I have been working out of all of them? (Hasn't been a problem...)
A: Unfortunate LocMap makers will not be dealing with any multi-source files. It might happen in the future, but you will then need to ask Amy W. for her help with citing them.

9. Where would I usually find the abstracts for the various files? Do I
paste in the entire thing?
A: The abstract for the file can be found on either the Census website, the ICPSR website, or the Incoming file box, or, if it is from a library material, you may find it on the University library catalog site, in the description of the book/cd/etc. It is not important which abstract you use so long as it adequately explains what the data contained in the study is about. However, we would like this to be kapt succinct, and the abstract off of the Census website tend to run a page or so long, so you might want to either cut them or use a different abstract.

12. “collDate? versus “timeperiod?? (Does the April first rule always aply?)
A: The collection date is the collection date, and timeperiod means the production time. Once again, collection date will always be April 1st because the Census does all the actual data collection. The date and year depend on who’s file we’re using- whether it was produced by ICPSR or produced by the Census. (If it was produced byt the Census, once again it will be an April 1st file.)

13.geogCover, geogUnit, universe, are all things that I'm not seeing in
your instructions. So far, I've grasped the cover pretty well (USA), and figure
out the unit after a bit of searching, but, sorry to have to ask an obvious sounding question, but what is my universe? (Winter break has erased odd things from my mind...)
A: If you can’t find this on the “Incoming? file, you will find it on the Census or ICPSR website’s version of the file. This is an important element to make sure you get right.

14. dataCollector would just be the census or the ICPSR again, right?
A: The data collector will always always be the Bureau of the Census.

15. Whats this?
A: We don’t know. We leave it blank.

16. Whats this? (Its under "use statement..."
A: This is just the same old document Citation again.

January 6, 2006

Configuring TextPad to Make it Easier to Use

Some TextPad Modifications that might come in handy:

You can set your own shortcuts for commands as follows:
From the Configure menu, choose Preferences. The Preferences dialog box is displayed.
Select the Keyboard page.
Select the command category from the list, and the corresponding commands in that category will be displayed.
Select the command you wish to set a shortcut for.
Type the shortcut in the "Press new shortcut key" box. It may consist of one or two characters.
Click the Assign button.

Example: change the current Find command of F5 to "CTRL+F" like in Internet Explorer

You can customize these view settings by choosing Preferences from the Configure menu:

Line numbers Display line numbers in the left margin of each view. This option changes the default for all documents. Use the command on the View menu to change it for the active document only.

Tabbed document selector Control whether it is displayed at the top or bottom of the window, and whether it is stacked or scrolling. Use the command on the View menu to display or hide the tabs.

Add the xml.syn and dditags.tcl to TextPad:

Get them from aggdata\METADATA\WORKING\Instructions and save them to the "[user profile]\Application Data\TextPad" folder

LocMaps for SSF files

When you make locMaps for SSTF files, there are four parts to the process.
You will map:
A. the variables not in nCubes at the beginning of each line
B. the line numbers at the end of each line
C. the nCubes
D. the filler spaces after the nCubes


You will need:
1. aggdata\\INCOMING\1990-SSF\whichever one you're working on\Docs\HOWTOUSE.ASC (if more than one, use all)*
2. aggdata\\INCOMING\1990-SSF\whichever one you're working on\Docs\IDEN_FTN.ASC (if more than one, use all)**
3. aggdata\\INCOMING\1990-SSF\whichever one you're working on\Docs\TBL_OUT.ASC (if more than one, use all)
4. putty.exe (download from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)
5. aggdata\Programs\LocMap_normal_var.pl (copy from here to the directory where your file is)
6. aggdata\Programs\LocMap_nCube_3.pl (copy from here to the directory where your file is)
7. TextPad
8. an extra file for storing the locMap parts
9. the xml file for which the locMap is being created
10. an unzipped copy of the data file in stored in aggdata\TRANSLATED\1990-SSF available for viewing.


*IDEN_FTN.ASC has the record layout (with file widths, etc) for the geographic and other non-nCube variables (denoted with an "M" for misc.).
**HOWTOUSE.ASC has the record layout for the nCubes.


I. Print or have open for viewing 1 and 2.


II. Start putty following the instructions below:
putty.gif
You'll get a black screen asking you to log in. Use your mpc username and password. Then navigate to the directory where the files and programs are with this command:
cd /pkg/aggdata/METADATA/dropbox/amy
If you are not working in dropbox, change as appropriate.


III. As needed, use these UNIX commands:
a1. ls = show me everything in my directory
a2. hitting the up arrow will enter in text you've entered so far in your session. So, if you've run a locMap program once, you don't have to type it again. Just hit the up arrow.
a3. Paste copied text by hitting the right mouse button. Inside putty, highlighting text also copies it.
a4. You can use the Tab key to fill out the rest of a directory name if that directory name is unique. Soooo..."cd /pkg/agg" + the Tab key fills out the rest of aggdata. It only takes one capital M to get metadata and one lower case d to get dropbox.


IV. Map the variables not in nCubes at the beginning of each line
Start LocMap_normal_var_v2.pl (see below).
Bold text = prompts from the LocMap_normal_var_v2.pl program; the rest are examples of what you would enter.

./LocMap_normal_var.pl
Enter file name: STF3-1-2-3-4.xml
Enter the variable ID of the first and last variable of your range (ex:U18-U20): ***
Mb>Input the number of repetitions for record identification strings that occur on EACH physical part of a longer logical record: 0 (or 1 or 16 or whatever) ****
Input the start position for first dataItem: 1 (always)
Input Variable widths:
Variable ID U18: 3 (these numbers come from IDEN_FTN.ASC)
Variable ID U19: 6
Variable ID U20: 2
Would you like to add any FILLS in this range(y/n): n (always n)

***Do not include the FILL or LINENO (LINENO= line number) variables in this group.

****The number of repetitions refers to record segments which are listed in the HOWTOUSE.ASC file. Open the file and do a find for "Segmentation of SSTF{fill in the number} Records". Count the TOTAL number of segments (whether a, b, c or whatever) and that's what you enter here. So a file with 1 A segment and 12 B segments has 13 total and you should enter 13. The image below shows the start of a segment listing (ignore the green text/arrows for now):

howtouse.gif

When you finish, an output file of the pattern "LocMap_normal_var.output" will be created in the same directory.


V. Save contents of LocMap_normal_var output to 7 and save.


VI. Map the line numbers at the end of each line
-Open the data file.
-Hit the End key.
-Look at the bottom of the Textpad window for the character number (see image).
-That's the line length.
-Scroll down to the last line of the file.
-Note the line number (2043 or 77 or whatever).
-Open 7.
-Copy the last dataitem.
-Change the ID to "DI_LINENO"
-Change varRef to "LINENO"
-Change the endPos to the line length (8174 in the image above)
-Change the width to the number of characters of the last line number (4 for 2043, 2 for 77)
-Adjust the startPos accordingly.
-Save this dataitem just above the </locMap> tag.

Example:
<dataItem ID="DI_LINENO" source="producer" varRef="LINENO">
<physLoc source="producer" recRef="REC_1" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_2" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_3" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_4" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_5" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_6" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_7" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_8" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_9" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_10" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_11" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_12" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_13" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_14" startPos="8172" width="3" endPos="8174" />
<physLoc source="producer" recRef="REC_15" startPos="8172" width="3" endPos="8174" />
</dataItem>


VI. Map the nCubes
Bold text = prompts from the LocMap_nCube_3.pl program; the rest are examples of what you would enter.
The program goes like this:

./LocMap_nCube_3.pl
Enter file name:
Enter the ID of the first and last nCube of your range (ex: NPB010-NPB015)
Enter recRef:
Enter Start position:
Enter width:

The program ends and an output file is created.

Now, SSTF files are broken into segments. These segments often start and/or end in the middle of an nCube. The program above can't capture this kind of break - it can only work with a whole table at a time. This means you have to construct the locMap segment by segment using the HOWTOUSE.ASC file.

Because the geography variables should all end at position 300, then each record segment (which is equal to 1 or more lines in the data file) should start at 301. In other words, multiple data items reference the same space. This is ok because each data item references a different record group, so each item is still a unique combination of record group and position.

If each segment broke at the end of each table, then you would run the LocMap_nCube_3.pl program once for each record segment. Instead, as you can see below, segments begin and/or end in the middle of tables.

howtouse.gif

Or, to put it this way:

recgroup.gif

LocMap_nCube_3.pl doesn't account for these kinds of breaks. Therefore, not only do you have to run the LocMap_nCube_3.pl program once for each record segment, but you have to adjust your starting position for each segment so that the whatever data item is at position 301 is the one you are supposed to start with.

For example, based on the image above:

The input for record A (REC_1) would be:
range: NPA001-NHA003
record ID: REC_1
start position: 301
width: 9

The input for record B segment 1 (REC_2) would be:
range: NPB001-NPB007
record ID: REC_2
start position: 301
width: 9

NOTE: ALL dataItems starting with the 32nd data item in table NPB007 will be deleted as they are NOT in this segment.

The input for record B segment 1 (REC_3) would be:
range: NPB007-NPB018
record ID: REC_3
start position: 22
width: 9

NOTE: All dataItems BEFORE NPB007 cell 32 will be deleted (cell 32 should have a start position of 301). All dataItems starting with the 73rd dataItem in table NPB018 will be deleted as they are NOT in this segment.

The input for Record B segment 3 would be:
range: NPB018-NPB022
record ID: REC_4
start position: 301 - (72*9) = -347
width: 9

NOTE: Same type of editing as before. Note that dataItem 73 in NPB18 starts in position 301. Because there are more than 300 characters in 72 cells (648) the start location you enter has to be a negative number to get the first cell of this table located in this segment to start at 301. All dataItems starting with the 134th dataItem in table NPB022 will be deleted as they are NOT in this segment.

You repeat this process until you reach the end of the record segments.


VII. Map the filler.
On each line, after the nCube data, but before the line numbers will be some blank spaces. We account for these using the FILL variable.
-After creating each locMap segment in VI, go to the last dataitem.
-Copy this:

<dataItem ID="DI_FILL#" source="producer" varRef="FILL">
<physLoc source="producer" recRef="REC_#" startPos="#" width="#" endPos="#" />
</dataItem>

-Each fill is sequential starting with 1.
-The REC_# should match the one it's filling out.
-The startPos = (endPos of preceding dataitem + 1)
-The endPos = (startPos of the lineno dataitem - 1)
-The width is the difference


VIII. Map "FILLER Tables" as filler variables.
In some cases, a record segment ends in the middle of a "FILLER Table". That means that the next segment begins with a FILLER Table and you need to calculate the correct start and end positions for the FILL variable that replaces the table and the first nCube in that segment.

Say you have two segments like this:

Segment 9

Geographic Identification PB40--17 data cells--through
Information PB44--259 data cells

8,165 characters including
5 characters filler

Segment 10

Geographic Identification PB44--227 data cells--through
Information PB45--388 data cells

8,165 characters including
2 characters filler

Check in TBL_OUT.ASC to see if either of the tables at the beginning or end is a FILLER Table. It will look like this:

PB47. FILLER

In those cases, before re-running LocMap_nCube_3.pl, insert another blank FILL variable.

-The REC_# should match the one it's filling out.
-The startPos = 301
-The endPos = 301+(the number of data cells for that table x 9) (in the example above, it would be 301+(227x9) = 2344
-The width is the difference

When you run LocMap_nCube_3.pl now, your nCube range is just PB45 and the startPos is 2344+1.
Then clip off the extra dataitems as in V.

IX.
When you have finished, copy the contents of your holding file (e.g. the whole locMap) into your original xml file immediately above <dataDscr>. Save and email me that you've finished.

December 16, 2005

Resources for creation sections 1 & 2

When you're putting in the info in sections 1 & 2 these resources will be handy:

1. The DDI schema - helps in case you get lost in repeating elements: http://webapp.icpsr.umich.edu/cocoon/DDI-STRUCTURE/Version2-1.xsd

2. The ICPSR Study description (if your file has one): http://www.icpsr.umich.edu/
2.1 Enter the ICPSR study number (on the first page of the documentation) and select "Study No" in the search field.
2.2 Wnen you get the results, click on "Description".

3. The full documentation file which will be in aggdata/INCOMING/xxxxx If you don't see anything that looks like your file, try looking in ICPSR-Studies in INCOMING.

4. To find the other people who've done data entry on the files, go to aggdata/METADATA/WORKING/amy/Excel/Amy_tracking.xls; list everyone associated...

Sandy and Amy K (although she's gone now till Spring Semester starts) have walked through this and can help...

If you need Amy M-Th...

If you need to ask me something on a day other than Friday, feel free to email or come to the library. You can see if I'll be available by going to my public schedule.

December 9, 2005

Valid/Invalid Ranges for Geo Vars

In the case when a geo var has a range (eg 01-52, 98, 99) of possible catValus we should use a "valid/invalid range." For instance, in the 2000 SF1 the variable "Congressional District 110" is listed in the data dictionairy. For this variable, the values 01-52 correspond to "The Actual congressional district number",value 00 "Applies to states whose representative is elected ‘‘at large’’; i.e., the state has only one representative in the United States House of Representatives" and so on.


<var ID="G52" name="CD110">
<location locMap="LM"/>
<labl level="var">Congressional District (110th)</labl>
<valrng>
<range min="00" max="52"/>
<range min="98" max="99"/>
<key>01-52: The Actual congressional district number; 00: Applies to states whose representative is elected ‘‘at large’’; i.e., the state has only one representative in the United States House of Representatives; 98: Applies to areas that have an ‘‘at large’’ nonvoting delegate or resident commissioner in the United States House of Representatives; 99: Applies to areas that have no representation in the United States House of Representatives</key>
</valrng>
</var>

We should use valid/invalid range when the codes we are referencing aren't commonly known. A couple examples of commonly known schemes are FIPS or MCD. It's not always so easy to know when coding schemes are "commonly known" so if there is any question ask either Wendy or Amy.

December 2, 2005

Blog's Nearly Back

So, our blog disappeared a while back. I did recover the old entries and they should be getting imported any day now. In the meantime, this one's ready to start using for new entries. You'll note that Jeff has already added an entry.

The address is http://blog.lib.umn.edu/mpc/unfor

Everyone receiving this message can add entries to the blog except Wendy because she has previously indicated that she doesn't want to :) You would go to the address above, click on Login and enter your regular UofM username and password.

If you add something you want to be sure everyone sees or that is time-sensitive, after you save the entry, the screen will reload and in the light blue bar at the top is a link for notification. Click that and choose whether you want to send part of the entry, all of it or none of it and submit. An email will be sent to all unfortunates, Jennifer, Wendy and me.

If you are new to blogs, let me know and I'll give you a quick introduction!