Tuesday, August 9, 2022

Browsing a GEDCOM File

A GEDCOM file is the standard for sharing genealogical information. The version 5.5.1 specification is old, but still widely used. A GEDCOM file is generally understood to be intended for output from one computer program, then input to another computer program. But the GEDCOM file is plain text. The file's contents can be browsed that way, and that may be the only way to see all the details that were output by the generating program.

At GitHub I have a little macOS app that I use to browse GEDCOM files. Some of my GEDCOM files are very old, sent by cousins when I first got into genealogy over 20 years ago. Some of the files were created only recently, when I downloaded information from WikiTree and from Ancestry. (It is the standard, so a GEDCOM file is what you get when you download your tree from somewhere.)

Why not just use a plain-text editor for browsing? Discussion follows this screenshot of the app.

In this example the opened file is my tree downloaded from Ancestry. That tree was created by me to aid Ancestry-DNA matches in determining common ancestors. Because of that sole purpose this file contains records for only 37 individuals (35 of them my known ancestors) in 18 families. (My private GEDCOM file on my personal computer has about 1500 individuals.) A single record in a GEDCOM file generally contains multiple lines. The beginning of a new record is indicated by, among other things, a zero as the first character on a line. Though not obvious from the screenshot, the entire opened file can be scrolled from beginning to end in the view on the left. I've scrolled to the point where the beginning of the record for Ann Cavanaugh, one of my great-great-grandmothers, is displayed. Having clicked on the "Process" button, the file has been separated into records which are sorted and selectable in the four views on the right. I can easily jump from one record to another. In the list of individual records I've selected Ann Cavanaugh--much easier selecting that one record than searching for and scrolling to Ann in the original file. But what if Ann had been in a file with thousands of individuals? I would have switched temporarily to sort the list of individuals by surname.

Each record in a GEDCOM file starts with a unique cross-reference label. Ancestry has generated very long cross-references for the individual records. Ann's individual record indicates that she is a spouse in family F17. Having selected F17 in the list of families, we can see that that family record is linked back to Ann's individual record--a two-way link between individual and family record, as required by GEDCOM. My little app is oblivious to most of GEDCOM's requirements and recommendations. Basically all it does is display whatever the content happens to be for each record.

While preparing this post and reviewing the earliest version of this screenshot, I noticed that there were too many families and individuals--duplicates and even triplicates, with the extras differing in having mangled links. Here is how I think that happened. Before last December I had already enjoyed connecting with DNA-test-match cousins at other sites. Worried that I might be missing something at Ancestry, last December I took advantage of their DNA test sale. Preparing for anticipated new connections I created for the first time at Ancestry my tree, adding only my ancestors, one at a time. After that I succumbed to the allure of confirming Ancestry's hints. I think that is how the extra, mangled relationships were generated. After several iterations of deletions and repairs, my tree is back to what I originally intended. So the point is, downloading and browsing of a GEDCOM file can help find problems that are not obvious using the vendor's interface.

I prefer WikiTree. These days my genealogical research results get transferred directly to WikiTree, either by updating existing WikiTree profiles, collaboratively if possible, or sometimes by adding new profiles there. The GitHub link for my GEDCOM Browser app has a README file on its main page which contains a screenshot similar to the one on this page. In the GitHub screenshot the file is a GEDCOM that I downloaded from WikiTree. In that file I've also selected Ann Cavanaugh for illustration, so the two screenshots can be compared and contrasted. The contrasts between a WikiTree GEDCOM and an Ancestry GEDCOM illustrate that there is a great deal of flexibility in how information can be organized in a GEDCOM file. That flexibility can be a hindrance when importing a GEDCOM from somewhere else.

There is a problem with GEDCOM files generated by WikiTree that transcends whether WikiTree follows the spirit of the GEDCOM law. The problem is well explained in this Tamura Jones web page from over five years ago. WikiTree exports the entire biographical section of a profile as one long note in the individual's GEDCOM record. A large number of continuation lines are needed. WikiTree breaks a line and continues on the next line when the line length is well short of the limit specified by GEDCOM. All of that is within the letter of the GEDCOM law. The problem comes when a two-byte UTF-8 character (for example, an accented letter in a Hungarian name, or an educated quotation mark) happens to straddle the arbitrary number of bytes where WikiTree breaks the line. A malformed UTF-8 file is generated. Fortunately my Mac text editor of choice follows orders when I tell it to open the file as UTF-8, but it complains and warns me, highlighting the problem spots. So it was relatively easy to manually repair the file. I only recently experimented with downloading GEDCOM from WikiTree, and it was surprising to eventually discover that the malformed UTF-8 problem had been identified five years ago and never fixed.

No comments:

Post a Comment