Skip to content

Commit 3499750

Browse files
authored
Sync to linguist 7.2.0: heuristics.yml support (#189)
Sync \w Github Linguist v7.2.0 Includes new way of handling `heuristics.yml` and all `./data/*` re-generated using Github Linguist [v7.2.0](https://github.com/github/linguist/releases/tag/v7.2.0) release tag. - many new languages - better vendoring detection - update doc on update&known issues.
1 parent 13d3d66 commit 3499750

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+114957
-84118
lines changed

CONTRIBUTING.md

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# source{d} Contributing Guidelines
2+
3+
source{d} projects accept contributions via GitHub pull requests.
4+
This document outlines some of the
5+
conventions on development workflow, commit message formatting, contact points,
6+
and other resources to make it easier to get your contribution accepted.
7+
8+
## Certificate of Origin
9+
10+
By contributing to this project, you agree to the [Developer Certificate of
11+
Origin (DCO)](DCO). This document was created by the Linux Kernel community and is a
12+
simple statement that you, as a contributor, have the legal right to make the
13+
contribution.
14+
15+
In order to show your agreement with the DCO you should include at the end of the commit message,
16+
the following line: `Signed-off-by: John Doe <[email protected]>`, using your real name.
17+
18+
This can be done easily using the [`-s`](https://github.com/git/git/blob/b2c150d3aa82f6583b9aadfecc5f8fa1c74aca09/Documentation/git-commit.txt#L154-L161) flag on the `git commit`.
19+
20+
If you find yourself pushed a few commits without `Signed-off-by`, you can still add it afterwards. We wrote a manual which can help: [fix-DCO.md](https://github.com/src-d/guide/blob/master/developer-community/fix-DCO.md).
21+
22+
## Support Channels
23+
24+
The official support channels, for both users and contributors, are:
25+
26+
- GitHub issues: each repository has its own list of issues.
27+
- Slack: join the [source{d} Slack](https://join.slack.com/t/sourced-community/shared_invite/enQtMjc4Njk5MzEyNzM2LTFjNzY4NjEwZGEwMzRiNTM4MzRlMzQ4MmIzZjkwZmZlM2NjODUxZmJjNDI1OTcxNDAyMmZlNmFjODZlNTg0YWM) community.
28+
29+
*Before opening a new issue or submitting a new pull request, it's helpful to
30+
search the project - it's likely that another user has already reported the
31+
issue you're facing, or it's a known issue that we're already aware of.
32+
33+
34+
## How to Contribute
35+
36+
Pull Requests (PRs) are the main and exclusive way to contribute code to source{d} projects.
37+
In order for a PR to be accepted it needs to pass this list of requirements:
38+
39+
- The contribution must be correctly explained with natural language and providing a minimum working example that reproduces it.
40+
- All PRs must be written idiomaticly:
41+
- for Go: formatted according to [gofmt](https://golang.org/cmd/gofmt/), and without any warnings from [go lint](https://github.com/golang/lint) nor [go vet](https://golang.org/cmd/vet/)
42+
- for other languages, similar constraints apply.
43+
- They should in general include tests, and those shall pass.
44+
- If the PR is a bug fix, it has to include a new unit test that fails before the patch is merged.
45+
- If the PR is a new feature, it has to come with a suite of unit tests, that tests the new functionality.
46+
- In any case, all the PRs have to pass the personal evaluation of at least one of the [maintainers](MAINTAINERS) of the project.
47+
48+
49+
### Format of the commit message
50+
51+
Every commit message should describe what was changed, under which context and, if applicable, the GitHub issue it relates to:
52+
53+
```
54+
plumbing: packp, Skip argument validations for unknown capabilities. Fixes #623
55+
```
56+
57+
The format can be described more formally as follows:
58+
59+
```
60+
<package>: <subpackage>, <what changed>. [Fixes #<issue-number>]
61+
```

Makefile

+5
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,11 @@ clean: clean-linguist clean-shared
4646
code-generate: $(LINGUIST_PATH)
4747
mkdir -p data && \
4848
go run internal/code-generator/main.go
49+
ENRY_TEST_REPO="$${PWD}/.linguist" go test -v \
50+
-run Test_GeneratorTestSuite \
51+
./internal/code-generator/generator \
52+
-testify.m TestUpdateGeneratorTestSuiteGold \
53+
-update_gold
4954

5055
benchmarks: $(LINGUIST_PATH)
5156
go test -run=NONE -bench=. && \

README.md

+9-12
Original file line numberDiff line numberDiff line change
@@ -154,14 +154,17 @@ Generated Java bindings using a C-shared library and JNI are located under [`jav
154154
Development
155155
------------
156156

157-
*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate the necessary code you must run:
157+
*enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate all the necessary code you must run:
158158

159+
git clone https://github.com/github/linguist.git .linguist
160+
# update commit in generator_test.go (to re-generate .gold fixtures)
161+
# https://github.com/src-d/enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18
159162
go generate
160163

161164
We update enry when changes are done in linguist's master branch on the following files:
162165

163166
* [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml)
164-
* [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb)
167+
* [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml)
165168
* [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml)
166169
* [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml)
167170

@@ -183,17 +186,11 @@ Divergences from linguist
183186
Using [linguist/samples](https://github.com/github/linguist/tree/master/samples)
184187
as a set for the tests, the following issues were found:
185188

186-
* With [hello.ms](https://github.com/github/linguist/blob/master/samples/Unix%20Assembly/hello.ms) we can't detect the language (Unix Assembly) because we don't have a matcher in contentMatchers (content.go) for Unix Assembly. Linguist uses this [regexp](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L300) in its code,
189+
* [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine
187190

188-
`elsif /(?<!\S)\.(include|globa?l)\s/.match(data) || /(?<!\/\*)(\A|\n)\s*\.[A-Za-z][_A-Za-z0-9]*:/.match(data.gsub(/"([^\\"]|\\.)*"|'([^\\']|\\.)*'|\\\s*(?:--.*)?\n/, ""))`
191+
* As of (Linguist v5.3.2)[https://github.com/github/linguist/releases/tag/v5.3.2] it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry stil uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. Tracked under https://github.com/src-d/enry/issues/193
189192

190-
which we can't port.
191-
192-
* All files for the SQL language fall to the classifier because we don't parse
193-
this [disambiguator
194-
expression](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L433)
195-
for `*.sql` files right. This expression doesn't comply with the pattern for the
196-
rest in [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb).
193+
* Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL". Tracked under https://github.com/src-d/enry/issues/194
197194

198195
`enry` [CLI tool](#cli) does not require a full Git repository to be present in filesystem in order to report languages.
199196

@@ -232,7 +229,7 @@ As benchmarks depend on Ruby and Github-Linguist gem make sure you have:
232229
If you want to reproduce the same benchmarks as reported above:
233230
- Make sure all [dependencies](#benchmark-dependencies) are installed
234231
- Install [gnuplot](http://gnuplot.info) (in order to plot the histogram)
235-
- Run `ENRY_TEST_REPO=.linguist benchmarks/run.sh` (takes ~15h)
232+
- Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h)
236233

237234
It will run the benchmarks for enry and linguist, parse the output, create csv files and plot the histogram. This takes some time.
238235

benchmark_test.go

+1-4
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,6 @@ var (
2828
)
2929

3030
func TestMain(m *testing.M) {
31-
var exitCode int
32-
defer os.Exit(exitCode)
33-
3431
flag.BoolVar(&slow, "slow", false, "run benchmarks per sample for strategies too")
3532
flag.Parse()
3633

@@ -47,7 +44,7 @@ func TestMain(m *testing.M) {
4744
log.Fatal(err)
4845
}
4946

50-
exitCode = m.Run()
47+
os.Exit(m.Run())
5148
}
5249

5350
func cloneLinguist(linguistURL string) error {

common.go

+5-6
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ const OtherLanguage = ""
1616
// Strategy type fix the signature for the functions that can be used as a strategy.
1717
type Strategy func(filename string, content []byte, candidates []string) (languages []string)
1818

19-
// DefaultStrategies is the strategies' sequence GetLanguage uses to detect languages.
19+
// DefaultStrategies is a sequence of strategies used by GetLanguage to detect languages.
2020
var DefaultStrategies = []Strategy{
2121
GetLanguagesByModeline,
2222
GetLanguagesByFilename,
@@ -397,12 +397,13 @@ func GetLanguagesByContent(filename string, content []byte, _ []string) []string
397397
}
398398

399399
ext := strings.ToLower(filepath.Ext(filename))
400-
fnMatcher, ok := data.ContentMatchers[ext]
400+
401+
heuristic, ok := data.ContentHeuristics[ext]
401402
if !ok {
402403
return nil
403404
}
404405

405-
return fnMatcher(content)
406+
return heuristic.Match(content)
406407
}
407408

408409
// GetLanguagesByClassifier uses DefaultClassifier as a Classifier and returns a sorted slice of possible languages ordered by
@@ -455,9 +456,7 @@ func GetLanguageType(language string) (langType Type) {
455456
// GetLanguageByAlias returns either the language related to the given alias and ok set to true
456457
// or Otherlanguage and ok set to false if the alias is not recognized.
457458
func GetLanguageByAlias(alias string) (lang string, ok bool) {
458-
a := strings.Split(alias, `,`)[0]
459-
a = strings.ToLower(a)
460-
lang, ok = data.LanguagesByAlias[a]
459+
lang, ok = data.LanguageByAlias(alias)
461460
if !ok {
462461
lang = OtherLanguage
463462
}

common_test.go

+51-24
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import (
1111
"gopkg.in/src-d/enry.v1/data"
1212

1313
"github.com/stretchr/testify/assert"
14+
"github.com/stretchr/testify/require"
1415
"github.com/stretchr/testify/suite"
1516
)
1617

@@ -19,9 +20,36 @@ const linguistClonedEnvVar = "ENRY_TEST_REPO"
1920

2021
type EnryTestSuite struct {
2122
suite.Suite
22-
repoLinguist string
23-
samplesDir string
24-
cloned bool
23+
tmpLinguist string
24+
needToClone bool
25+
samplesDir string
26+
}
27+
28+
func (s *EnryTestSuite) TestRegexpEdgeCases() {
29+
var regexpEdgeCases = []struct {
30+
lang string
31+
filename string
32+
}{
33+
{lang: "ActionScript", filename: "FooBar.as"},
34+
{lang: "Forth", filename: "asm.fr"},
35+
{lang: "X PixMap", filename: "cc-public_domain_mark_white.pm"},
36+
//{lang: "SQL", filename: "drop_stuff.sql"}, // https://github.com/src-d/enry/issues/194
37+
{lang: "Fstar", filename: "Hacl.Spec.Bignum.Fmul.fst"},
38+
{lang: "C++", filename: "Types.h"},
39+
}
40+
41+
for _, r := range regexpEdgeCases {
42+
filename := fmt.Sprintf("%s/samples/%s/%s", s.tmpLinguist, r.lang, r.filename)
43+
44+
content, err := ioutil.ReadFile(filename)
45+
require.NoError(s.T(), err)
46+
47+
lang := GetLanguage(r.filename, content)
48+
s.T().Logf("File:%s, lang:%s", filename, lang)
49+
50+
expLang, _ := data.LanguageByAlias(r.lang)
51+
require.EqualValues(s.T(), expLang, lang)
52+
}
2553
}
2654

2755
func Test_EnryTestSuite(t *testing.T) {
@@ -30,25 +58,24 @@ func Test_EnryTestSuite(t *testing.T) {
3058

3159
func (s *EnryTestSuite) SetupSuite() {
3260
var err error
33-
s.repoLinguist = os.Getenv(linguistClonedEnvVar)
34-
s.cloned = s.repoLinguist == ""
35-
if s.cloned {
36-
s.repoLinguist, err = ioutil.TempDir("", "linguist-")
37-
assert.NoError(s.T(), err)
38-
}
39-
40-
s.samplesDir = filepath.Join(s.repoLinguist, "samples")
41-
42-
if s.cloned {
43-
cmd := exec.Command("git", "clone", linguistURL, s.repoLinguist)
61+
s.tmpLinguist = os.Getenv(linguistClonedEnvVar)
62+
s.needToClone = s.tmpLinguist == ""
63+
if s.needToClone {
64+
s.tmpLinguist, err = ioutil.TempDir("", "linguist-")
65+
require.NoError(s.T(), err)
66+
s.T().Logf("Cloning Linguist repo to '%s' as %s was not set\n",
67+
s.tmpLinguist, linguistClonedEnvVar)
68+
cmd := exec.Command("git", "clone", linguistURL, s.tmpLinguist)
4469
err = cmd.Run()
45-
assert.NoError(s.T(), err)
70+
require.NoError(s.T(), err)
4671
}
72+
s.samplesDir = filepath.Join(s.tmpLinguist, "samples")
73+
s.T().Logf("using samples from %s", s.samplesDir)
4774

4875
cwd, err := os.Getwd()
4976
assert.NoError(s.T(), err)
5077

51-
err = os.Chdir(s.repoLinguist)
78+
err = os.Chdir(s.tmpLinguist)
5279
assert.NoError(s.T(), err)
5380

5481
cmd := exec.Command("git", "checkout", data.LinguistCommit)
@@ -60,8 +87,8 @@ func (s *EnryTestSuite) SetupSuite() {
6087
}
6188

6289
func (s *EnryTestSuite) TearDownSuite() {
63-
if s.cloned {
64-
err := os.RemoveAll(s.repoLinguist)
90+
if s.needToClone {
91+
err := os.RemoveAll(s.tmpLinguist)
6592
assert.NoError(s.T(), err)
6693
}
6794
}
@@ -88,7 +115,7 @@ func (s *EnryTestSuite) TestGetLanguage() {
88115
}
89116

90117
func (s *EnryTestSuite) TestGetLanguagesByModelineLinguist() {
91-
var modelinesDir = filepath.Join(s.repoLinguist, "test/fixtures/Data/Modelines")
118+
var modelinesDir = filepath.Join(s.tmpLinguist, "test/fixtures/Data/Modelines")
92119

93120
tests := []struct {
94121
name string
@@ -400,15 +427,16 @@ func (s *EnryTestSuite) TestGetLanguageByAlias() {
400427
func (s *EnryTestSuite) TestLinguistCorpus() {
401428
const filenamesDir = "filenames"
402429
var cornerCases = map[string]bool{
403-
"hello.ms": true,
430+
"drop_stuff.sql": true, // https://github.com/src-d/enry/issues/194
431+
// .es and .ice fail heuristics parsing, but do not fail any tests
404432
}
405433

406434
var total, failed, ok, other int
407435
var expected string
408436
filepath.Walk(s.samplesDir, func(path string, f os.FileInfo, err error) error {
409437
if f.IsDir() {
410438
if f.Name() != filenamesDir {
411-
expected = f.Name()
439+
expected, _ = data.LanguageByAlias(f.Name())
412440
}
413441

414442
return nil
@@ -431,17 +459,16 @@ func (s *EnryTestSuite) TestLinguistCorpus() {
431459
} else {
432460
status = "failed"
433461
failed++
434-
435462
}
436463

437464
if _, ok := cornerCases[filename]; ok {
438-
fmt.Printf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
465+
s.T().Logf("\t\t[considered corner case] %s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status)
439466
} else {
440467
assert.Equal(s.T(), expected, obtained, fmt.Sprintf("%s\texpected: %s\tobtained: %s\tstatus: %s\n", filename, expected, obtained, status))
441468
}
442469

443470
return nil
444471
})
445472

446-
fmt.Printf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
473+
s.T().Logf("\t\ttotal files: %d, ok: %d, failed: %d, other: %d\n", total, ok, failed, other)
447474
}

0 commit comments

Comments
 (0)