WO2015105498A1

WO2015105498A1 - Auto completion of source code constructs

Info

Publication number: WO2015105498A1
Application number: PCT/US2014/010951
Authority: WO
Inventors: Ohad Assulin; Elad BENEDICT; Amit BEZALEL
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2014-01-10
Filing date: 2014-01-10
Publication date: 2015-07-16

Abstract

Disclosed herein are techniques for auto completion of source code constructs. Source code samples are used to differentiate between proper and improper source code constructs. Auto completion options are displayed such that the completions are ordered based at least partially on a likelihood that each option results in a proper construct.

Description

AUTO COMPLETION OF SOURCE CODE CONSTRUCTS

[0001] Computer programs may contain instructions that describe actions to be performed by a computer processor. A computer programmer may create the instructions ("source code") of a computer program. A programmer may edit a program's source code manually or may be assisted by an integrated development environment ("IDE").

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] Fig. 1 is a block diagram of an example system in accordance with aspects of the present disclosure.

[0003] Fig. 2 is a flow diagram of an example method in accordance with aspects of the present disclosure.

[0004] Fig. 3 is an example multidimensional space in accordance with aspects of the present disclosure.

[0005] Fig. 4 is a further example multidimensional space in accordance with aspects of the present disclosure.

DETAILED DESCRIPTION

[0006] As noted above, a programmer may manually edit a program's source code or may be assisted by an IDE. A programmer may use a typed programming language to implement the source code of a computer program. In a typed programming language, a variable's type may limit the operations applicable on that variable. By way of example, a variable of type String may contain the value "text." A compiler of a programming language may reject an attempt to divide a number by such a variable because its type is defined as a string of characters and not as an integer. Thus, types may make it easier for a compiler to validate language constraints on the source code.

[0007] A static data type may be determined when the program is compiled or before the program is executed. Static types may be explicitly defined or declared by a programmer. Many popular programming languages, such as C++, C# and Java, allow programmers to explicitly define static types that may be detected by a compiler. In contrast, dynamic data types may be determined at execution time. Therefore, dynamic data types may be associated with run-time values rather than predefined textual expressions. In this instance, a programmer is not required to explicitly define such types. However, type errors cannot be automatically detected until the program is executed. Some examples of dynamically typed languages include, but are not limited to, Lisp, Perl, Python, JavaScript, and Ruby.

[0008] IDEs often provide programmers with auto completion to make coding more efficient and easier. By way of example, a type Integer may have been predefined with two methods, add and subtract. A programmer may define a variable /^' of type Integer. While typing the variable "/^'. ", a drop down box may appear displaying the methods add and subtract which allows the programmer to simply click on the method of choice. Upon clicking the method of choice, the IDE may insert the selected method into the code. Such auto completions may save a programmer time while coding. Unfortunately, auto completion of source code constructs based on dynamic types is often problematic, since the properties of these types are unknown until runtime. While dynamic types have become increasingly popular of late, the rise in popularity of dynamic types may also make auto code completion less effective at saving time while coding.

[0009] In view of the foregoing, disclosed herein are a system, non-transitory computer readable medium, and method for auto completion of source code constructs. In one example, source code samples may be used to differentiate between proper and improper source code constructs. In another example, auto completion options may be displayed such that the completions are ordered based at least partially on a likelihood that each option would result in a proper source code construct. The techniques disclosed herein may be used to provide auto completion options for dynamic types that are not predefined. Thus, the techniques disclosed herein may provide auto completion options to programmers even while coding dynamic types. The aspects, features and advantages of the present disclosure will be appreciated when considered with reference to the following description of examples and accompanying figures. The following description does not limit the application; rather, the scope of the disclosure is defined by the appended claims and equivalents.

[0010] FIG. 1 presents a schematic diagram of an illustrative computer apparatus 100 for executing the techniques disclosed herein. The computer apparatus 100 may include all the components normally used in connection with a computer. For example, it may have a keyboard and mouse and/or various other types of input devices such as pen-inputs, joysticks, buttons, touch screens, etc., as well as a display, which could include, for instance, a CRT, LCD, plasma screen monitor, TV, projector, etc. Computer apparatus 100 may also comprise a network interface (not shown) to communicate with other devices over a network. The computer apparatus 100 may also contain a processor 1 10, which may be any number of well known processors, such as processors from Intel ® Corporation. In another example, processor 1 10 may be an application specific integrated circuit ("ASIC"). Non-transitory computer readable medium ("CRM") 1 12 may store instructions that may be retrieved and executed by processor 1 10. In one example, the instructions may include a learning module 1 14 and a code completion module 1 16. Non-transitory CRM 1 12 may be used by or in connection with any instruction execution system that can fetch or obtain the logic therefrom and execute the instructions contained therein.

[001 1] Non-transitory CRM 1 12 may comprise any one of many physical media such as, for example, electronic, magnetic, optical, electromagnetic, or semiconductor media. More specific examples of suitable non-transitory CRM include, but are not limited to, a portable magnetic computer diskette such as floppy diskettes or hard drives, a read-only memory ("ROM"), an erasable programmable read-only memory, a portable compact disc or other storage devices that may be coupled to computer apparatus 100 directly or indirectly. Alternatively, non-transitory CRM 1 12 may be a random access memory ("RAM") device or may be divided into multiple memory segments organized as dual in-line memory modules ("DIMMs"). The non-transitory CRM 1 12 may also include any combination of one or more of the foregoing and/or other devices as well. While only one processor and one non-transitory CRM are shown in FIG. 1 , computer apparatus 100 may actually comprise additional processors and memories that may or may not be stored within the same physical housing or location.

[0012] The instructions residing in non-transitory CRM 1 12 may comprise any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by processor 1 10. In this regard, the terms "instructions," "scripts," and "applications" may be used interchangeably herein. The computer executable instructions may be stored in any computer language or format, such as in object code or modules of source code. Furthermore, it is understood that the instructions may be implemented in the form of hardware, software, or a combination of hardware and software and that the examples herein are merely illustrative.

[0013] As will be discussed in more detail below, learning module 1 14 may instruct processor 1 10 to differentiate between proper and improper source code constructs contained in source code samples. Furthermore, code completion module 1 16 may instruct processor 1 10 to display an ordered list of auto completion options when a source code prefix is detected such that the completions are ordered based on a likelihood that each completion results in a proper source code construct, as determined by the learning module, when appended to the prefix.

[0014] Working examples of the system, method, and non-transitory computer-readable medium are shown in FIGS. 2-4. In particular, FIG. 2 illustrates a flow diagram of an example method 200 for source code auto completion. FIG. 3 is an example of features of source code samples plotted in a multidimensional space and FIG. 4 is an example of features in previously used constructs plotted in the multidimensional space. The actions shown in FIGS. 3-4 will be discussed below with regard to the flow diagram of FIG. 2.

[0015] As shown in block 202 of FIG. 2, features of source code samples may be categorized to distinguish between proper and improper source code constructs. In one implementation, a team of researchers may determine which sample source code constructs are proper and improper. Such determination may be done visually or with the assistance of automated tools, such as dimensionality reduction algorithms (e.g., Kernel principal component analysis, multi-linear principal component analysis, etc.). In one example, a proper source code construct may be defined as a construct that would compile successfully and execute successfully at runtime. The features of the source code constructs may include, but are not limited to, number of characters in the construct, presence of special characters, or the presence of certain key words.

[0016] In another example, cross validation may be employed to determine which of the extracted features are most indicative of proper and improper source code constructs. Cross validation is a statistical technique for estimating the accuracy of a predictive model. Cross validation may filter out features that seem significant within the context of a limited data set, but are insignificant generally. Thus, cross validation prevents researchers from accepting that a feature is highly indicative of a proper source code construct generally based on a limited data set. One round of cross-validation may involve partitioning a sample of data into complementary subsets. One subset may be used as a training set and another set may be used to validate the analysis of the training set. Multiple rounds of cross-validation may be performed using different partitions and the validation results may be averaged over the multiple rounds.

[0017] Referring now to FIG. 3, a multidimensional space 300 is shown. For ease of illustration, only three dimensions are depicted, but it should be understood that many more dimensions may be used depending on the number of features detected. That is, each feature may be represented by a dimension in the graph. Learning module 1 14 may model the determination of proper and improper source code constructs as a binary classification problem. In one example, learning module 1 14 may comprise a support vector machine ("SVM") algorithm. An SVM algorithm is a binary classifier that may be employed to categorize new data into one of two classes (e.g., proper or improper source code constructs) based on a set of training samples. However, it is understood that other algorithms may be employed, such as, but not limited to, na^'ive Bayes or neural networks. [0018] As noted above, learning module 1 14 may be provided with a set of source code samples such that each sample is manually labeled as a proper or improper source code construct. Moreover, each sample submitted to learning module 1 14 may be accompanied by an associated vector and each value in the vector may correspond to one of the detected features. Learning module 1 14 may plot these features in an n-dimensional space such that n is equal to the number of detected features. Since the vectors are already labeled as proper and improper, learning module 1 14 may associate different patterns of vector values with one of the two categories. By way of example, there may be three features detected during analysis of the source code: number of characters, whether the construct contains a particular key word, and whether the construct contains a special character. Thus, a training source code construct of "t.ssn_number" may be represented by the vector <12, 0, 1 >, wherein 12 is the number of characters in the query, 0 indicates that the query does not contain a keyword, and 1 indicates that the source code construct does have a special character (in this example the special character is "_"). Learning module 1 14 may plot this vector in a three-dimensional space. As noted above, three dimensions are used in the examples herein for ease of illustration. That is, if twelve features are detected, learning module 1 14 may plot such features in a twelve dimensional space.

[0019] Multidimensional space 300 shown in FIG. 3 may be generated by learning module 1 14 in accordance with three features. Each point plotted in cluster 304 may be associated with proper source code constructs and each point plotted in cluster 302 may be associated with improper source code constructs. Learning module 1 14 may identify a boundary 306 that differentiates the two classes of source code constructs. As will be discussed further below, boundary 306 may be a decision boundary that may be used to asses future source code constructs. Thus, one goal of learning module 1 14 may be to determine the line or hyperplane, out of all possible lines or hyperplanes, that best represents the boundary between proper and improper source code constructs. If a boundary could not be found, learning module 1 14 may utilize statistical techniques, such as Gaussian kernel, to rearrange the graph.

[0020] Referring back to FIG. 2, it may be determined whether a source code prefix is detected, as shown in block 204. After learning module 1 14 is trained, the resulting graph may be used to rank other source code constructs. In another example, real-time typing of source code may be monitored to detect a source code prefix. One popular source code format is "someObject.XXX' . In this instance, upon detection of the prefix "someObject ", code completion module 1 16 may display an ordered list of auto completion options. The completions may be ranked or ordered based at least partially on an analysis of the multidimensional space and on a likelihood that each completion results in a proper source code construct when appended to the prefix, as shown in block 206 of FIG. 2. In another example, the auto completion options may include previously used completions of previously used constructs. In a further example, previously used completions may be completions used for constructs in a software project associated with the source code file in which the prefix was detected. In another aspect, features of the previously used constructs may be detected and plotted in the multidimensional space to determine the likelihood that each previously used completion would result in a proper construct when appended to the prefix. In turn, the programmer may select the desired previously used completion for the prefix.

[0021] Referring now to FIG. 4, constructs that have been used by a programmer are shown being plotted in the multidimensional space. FIG. 4 shows points 402, 404, 406, 408, and 410 plotted in the multidimensional space. These example points are indicative of feature vectors associated with previously used constructs. Learning module 1 14 may determine which side of boundary 306 to plot the constructs previously used by the programmer based on their features. As the distribution changes over time, learning module 1 14 may determine that a new boundary should be defined.

[0022] In one example, the likelihood that a completion of a previously used construct would result in a proper construct may be based on the distance between the detected features of the previously used constructs plotted in multidimensional space 300 and the boundary 306. As such, the further a vector associated with a given construct is plotted from boundary 306, the higher or lower the completion resulting in that construct is ranked in the auto complete list. In the example of FIG. 4, the completion resulting in the construct associated with point 406 may be ranked higher than any other, since it's the furthest from boundary 306 and it's on the "proper" side of boundary 306. The lowest ranked completion may be the completion resulting in the construct associated with point 408, since it's the furthest from boundary 306 and it's on the "improper" side of boundary 306.

[0023] Advantageously, the foregoing system, method, and non-transitory computer readable medium provides a ranked auto completion list for source code constructs associated with dynamic types. In this regard, rather than displaying auto completions randomly, the completions may be ranked based on features of the resulting constructs as compared to features learned from source code samples. In turn, programmers may continue to code efficiently despite their use of dynamic types.

[0024] Although the disclosure herein has been described with reference to particular examples, it is to be understood that these examples are merely illustrative of the principles of the disclosure. It is therefore to be understood that numerous modifications may be made to the examples and that other arrangements may be devised without departing from the spirit and scope of the disclosure as defined by the appended claims. Furthermore, while particular processes are shown in a specific order in the appended drawings, such processes are not limited to any particular order unless such order is expressly set forth herein; rather, processes may be performed in a different order or concurrently and steps may be added or omitted.

Claims

1 . A system comprising:

a learning module which upon execution instructs at least one processor to differentiate between proper and improper source code constructs contained in source code samples; and

a code completion module which upon execution instructs at least one processor to display an ordered list of auto completion options when a source code prefix is detected such that the completions are ordered based on a likelihood that each completion would result in a proper source code construct, as determined by the learning module, when appended to the prefix.

2. The system of claim 1 , wherein the learning module upon execution instructs at least one processor to plot features of the source code samples in a multidimensional space and determine a boundary within the plotted features that differentiates between proper source code constructs and improper source code constructs.

3. The system of claim 2, wherein the learning module upon execution instructs at least one processor to determine the boundary using a support vector machine algorithm.

4. The system of claim 2, wherein the completion module upon execution further instructs at least one processor to:

include previously used completions of previously used constructs in the list of auto completion options;

detect features of each previously used construct; and

plot the detected features in the multidimensional space to determine the likelihood that each previously used completion would result in a proper source code construct when appended to the prefix.

5. The system of claim 4, wherein the likelihood is further based on a distance between the detected features of the previously used constructs plotted in the multidimensional space and the boundary within the plotted features that distinguishes between proper source code constructs and improper source code constructs.

6. A non-transitory computer readable medium having instructions therein which, if executed, cause a processor to:

plot features of source code samples in a multidimensional space; differentiate between proper and improper source code constructs based on an analysis of the plotted features;

monitor real-time typing of source code to detect a source code prefix; and

display an ordered list of auto completion options when the prefix is detected such that the completions are ordered based at least partially on an analysis of the multidimensional space and a likelihood that each completion results in a proper source code construct when appended to the prefix.

7. The non-transitory computer readable medium of claim 6, wherein the instructions therein, if executed, further instruct at least one processor to determine a boundary within the plotted features in the multidimensional space to differentiate between proper source code constructs and improper source code constructs.

8. The non-transitory computer readable medium of claim 7, wherein the boundary is determined using a support vector machine algorithm.

9. The non-transitory computer readable medium of claim 7, wherein the instructions therein, if executed, further instruct at least one processor to:

detect features of each previously used construct; and

10. The non-transitory computer readable medium of claim 9, wherein the likelihood is further based on a distance between the detected features of the previously used constructs plotted in the multidimensional space and the boundary within the plotted features that distinguishes between proper source code constructs and improper source code constructs.

1 1 . A method comprising:

plotting, using at least one processor, features of source code samples in a multidimensional space;

categorizing, using at least one processor, the plotted features as being indicative of proper source code constructs or improper source code constructs;

monitoring, using at least one processor, typing of source code to detect a source code prefix; and

if the prefix is detected, displaying, using at least one processor, a list of auto completion options that are ranked based at least partially on an analysis of the multidimensional space and on a likelihood that each completion results in a proper source code construct when appended to the prefix.

12. The method of claim 1 1 , determining, using at least one processor, a boundary within the plotted features that differentiates between proper source code constructs and improper source code constructs.

13. The method of claim 12, wherein the boundary is determined using a support vector machine algorithm.

14. The method of claim 12, further comprising

including, using at least one processor, previously used completions of previously used constructs in the list of auto completion options;

detecting, using at least one processor, features of each previously used construct; and

plotting, using at least one processor, the detected features in the multidimensional space to determine the likelihood that each previously used completion would result in a proper source code construct when appended to the prefix.

15. The method of claim 12, wherein the likelihood is further based on a distance between the detected features of the previously used constructs plotted in the multidimensional space and the boundary within the plotted features that distinguishes between proper source code constructs and improper source code constructs.