Dataset

Enron Email Data with Manager-Subordinate Relationship Metadata

Added By cpdiehl

GraphML Representation of the Enron Email Dataset – Version 0.12

Overview

This dataset contains a representation of the Enron email dataset derived from the MySQL representations previously released by USC/ISI 1 and UC Berkeley 2. In addition, it contains ground truth about a set of manager-subordinate relationships within the company that existed between January 2000 and November 2001 3. This ground truth data was the basis for the experiments on social relationship identification conducted in 4.

Data Description

Email Message Data

The email message data is derived from the messages, addresses and recipients tables in the UC Berkeley MySQL database. The UC Berkeley data was selected over the USC/ISI data due to its completeness. Both datasets have been conditioned in some unknown manner by these institutions. After inspecting both, it became clear that the UC Berkeley dataset contains more dimensions of the original data.

The GraphML representation of the messages captures much of the information associated with a given message using Email Address and Message nodes. The attributes of each node type are described below.

Email Address Node
address – [String] The email address represented by the node.
fullyObserved – [Boolean] According to the USC/ISI data, the Enron email dataset represents the union of the inboxes of 151 employees. The addresses of those employees are listed in the employeelist table in the USC/ISI MySQL database.
type – [String] “Email Address”

Message Node
datetime – [String] The datetime associated with the message in UTC.
epochSecs – [Integer] The datetime represented as seconds from the epoch.
subject – [String] The subject of the message.
body – [String] The body of the message.
emailID – [String] The email message ID from the message header.
type – [String] “Message”

Senders and recipients of the message are captured by SENT and RECEIVED_BY relationships. The attributes of each relationship type are described below.

SENT Relationship
datetime – [String] The datetime associated with the message in UTC.
epochSecs – [Integer] The datetime represented as seconds from the epoch.

RECEIVED_BY Relationship
datetime – [String] The datetime associated with the message in UTC.
epochSecs – [Integer] The datetime represented as seconds from the epoch.
type – [String] “to” / “cc” / “bcc”
order – [Integer] The position of the recipient address in the list of addresses with the specified type.

The datetime and epochSecs fields in the SENT and RECEIVED_BY relationships are identical to those of the Message nodes that they are connected to.

The employees for which manager-subordinate relationship ground truth is available are represented by Person nodes. The attributes of a Person node are described below.

Person Node
lastName – [String] The individual’s last name.
firstName – [String] The individual’s first name.
provenance – [String] The source containing evidence of the individual’s existence.
type – [String] “Person”

Manager-subordinate social relationships between individuals are represented by DIRECTLY_REPORTED_TO relationships in GraphML. An individual’s use of an email address is captured by USED_EMAIL_ADDRESS relationships. The attributes of these relationships are described below.

DIRECTLY_REPORTED_TO Relationship
startDatetime – [String] Datetime for the beginning of the interval over which the relationship is known to exist.
endDatetime – [String] Datetime for the end of the interval over which the relationship is known to exist.
evidenceType – [String] “Interval”. If the evidence were derived from an email, for example, the evidence type would often be point evidence since most messages do not provide evidence of the extent of the relationship.
provenance – [String] The source containing evidence of the relationship’s existence.
type – [String] “Social Relationship”

USED_EMAIL_ADDRESS Relationship
No attributes

Images depicting the entity-relationship diagram and the attributes associated with nodes and relationships are available 5.

Dataset Version History

Version 0.12 – Corrected minor date error in date range for manager-subordinate relationships.
Version 0.11 – Removed duplicate Person nodes.
Version 0.10 – Initial release.

Contact Information
Questions, comments and suggestions are most welcome.
Contact Chris Diehl at diehl@alumni.cmu.edu.

References

1 http://www.isi.edu/~adibi/Enron/Enron.htm
2 http://bailando.sims.berkeley.edu/enron_email.html
3 https://github.com/diehl/Enron-GraphML-Data-Documentation/blob/master/EnronManagerSubordinateRelationships.pdf
4 C. P. Diehl, G. Namata, L. Getoor, “Relationship Identification for Social Network Discovery,” AAAI 2007
5 https://github.com/diehl/Enron-GraphML-Data-Documentation

License

Public Domain (Data - Uncopyrightable)

There are some things that copyright law will not protect. Copyright will not protect the titles of a book or movie, nor will it protect short phrases such as “Make my day.” Copyright protection also doesn’t cover facts, ideas or theories. These things are free for all to use without authorization.
a. Short Phrases
Phrases such as “Show me the money” or “Beam me up” are not protected under copyright law. Short phrases, names, titles or small groups of words are considered common idioms of the English language and are free for anyone to use. In subsequent chapters we’ll explain how this rule applies to specific types of works. However, a short phrase used as an advertising slogan is protectible under trademark law. In that case, you could not use a similar phrase for the purpose of selling products or services. b. Facts and Theories
A fact or a theory—for example, the fact that a comet will pass by the Earth in 2027 —is not protected by copyright. If a scientist discovered this fact, anyone would be free to use it without asking for permission from the scientist. Similarly, if someone creates a theory that the comet can be destroyed by a nuclear device, anyone could use that theory to create a book or movie. However, the unique manner in which a fact is expressed may be protected. Therefore, if a filmmaker created a movie about destroying a comet with a nuclear device, the specific way he presented the ideas in the movie would be protected by copyright.
EXAMPLE: Neil Young wrote a song, “Ohio,” about the shooting of four college students during the Vietnam War. You are free to use the facts surrounding the shooting but you may not copy Mr. Young’s unique expression of these facts without his permission.
In some cases, you are not free to copy a collection of facts because the collection of facts may be protectible as a compilation (see Section B5). For more information on how copyright applies to facts, refer to Chapter 2, Section F3.
c. Ideas
Copyright law does not protect ideas; it only protects the particular way an idea is expressed. What’s the difference between an idea and its expression? In the case of a story or movie, the idea is really the plot in its most basic form. For example, the “idea” of the movie Contact is that a determined scientist, seeking to improve humankind, communicates with alien life forms. The same idea has been used in many motion pictures, books and television shows including The Day the Earth Stood Still, The Abyss and Star Trek. Many paintings, photographs and songs contain similar ideas. You can always use the underlying idea or theme —such as communicating with aliens for the improvement of the world —but you cannot copy the unique manner in which the author expresses the idea. This unique expression may include literary devices such as dialogue, characters and subplots. d. U.S. Government Works
Any work created by a U.S. government employee or officer is in the public domain, provided that the work is created in that person’s official capacity. For example, during the 1980s a songwriter used words from a speech by then-President Ronald Reagan as the basis for song lyrics. The words from the speech were in the public domain and permission was not required from Ronald Reagan. Keep in mind that this rule applies only to works created by federal employees, and not to works created by state or local government employees. However, state and local laws and court decisions are in the public domain.
Some federal publications (or portions of them) are protected under copyright law and that fact is usually indicated on the title page or in the copyright notice. For example, the IRS may acquire permission to use a copyrighted chart in a federal tax booklet. The document may indicate that a certain chart is "Copyright Dr. Matt Polazzo. " In that case, you could not copy the chart without permission from Dr. Polazzo.